@zigrivers/scaffold 3.22.0 → 3.23.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +21 -7
- package/content/knowledge/data-science/README.md +23 -0
- package/content/knowledge/data-science/data-science-architecture.md +163 -0
- package/content/knowledge/data-science/data-science-conventions.md +233 -0
- package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
- package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
- package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
- package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
- package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
- package/content/knowledge/data-science/data-science-observability.md +161 -0
- package/content/knowledge/data-science/data-science-project-structure.md +178 -0
- package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
- package/content/knowledge/data-science/data-science-requirements.md +151 -0
- package/content/knowledge/data-science/data-science-security.md +151 -0
- package/content/knowledge/data-science/data-science-testing.md +183 -0
- package/content/knowledge/ml/README.md +10 -0
- package/content/methodology/data-science-overlay.yml +39 -0
- package/dist/config/schema.d.ts +672 -126
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +8 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +2 -2
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/data-science.d.ts +4 -0
- package/dist/config/validators/data-science.d.ts.map +1 -0
- package/dist/config/validators/data-science.js +15 -0
- package/dist/config/validators/data-science.js.map +1 -0
- package/dist/config/validators/index.d.ts.map +1 -1
- package/dist/config/validators/index.js +2 -0
- package/dist/config/validators/index.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
- package/dist/core/assembly/knowledge-loader.js +6 -0
- package/dist/core/assembly/knowledge-loader.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.test.js +34 -0
- package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +73 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/project/adopt.d.ts.map +1 -1
- package/dist/project/adopt.js +3 -1
- package/dist/project/adopt.js.map +1 -1
- package/dist/project/detectors/coverage.test.d.ts +2 -0
- package/dist/project/detectors/coverage.test.d.ts.map +1 -0
- package/dist/project/detectors/coverage.test.js +78 -0
- package/dist/project/detectors/coverage.test.js.map +1 -0
- package/dist/project/detectors/data-science.d.ts +4 -0
- package/dist/project/detectors/data-science.d.ts.map +1 -0
- package/dist/project/detectors/data-science.js +32 -0
- package/dist/project/detectors/data-science.js.map +1 -0
- package/dist/project/detectors/data-science.test.d.ts +2 -0
- package/dist/project/detectors/data-science.test.d.ts.map +1 -0
- package/dist/project/detectors/data-science.test.js +62 -0
- package/dist/project/detectors/data-science.test.js.map +1 -0
- package/dist/project/detectors/disambiguate.d.ts +2 -0
- package/dist/project/detectors/disambiguate.d.ts.map +1 -1
- package/dist/project/detectors/disambiguate.js +3 -2
- package/dist/project/detectors/disambiguate.js.map +1 -1
- package/dist/project/detectors/disambiguate.test.js +10 -1
- package/dist/project/detectors/disambiguate.test.js.map +1 -1
- package/dist/project/detectors/index.d.ts.map +1 -1
- package/dist/project/detectors/index.js +2 -0
- package/dist/project/detectors/index.js.map +1 -1
- package/dist/project/detectors/library.d.ts.map +1 -1
- package/dist/project/detectors/library.js +1 -0
- package/dist/project/detectors/library.js.map +1 -1
- package/dist/project/detectors/resolve-detection.test.js +31 -0
- package/dist/project/detectors/resolve-detection.test.js.map +1 -1
- package/dist/project/detectors/types.d.ts +6 -2
- package/dist/project/detectors/types.d.ts.map +1 -1
- package/dist/project/detectors/types.js.map +1 -1
- package/dist/types/config.d.ts +8 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/copy/core.d.ts.map +1 -1
- package/dist/wizard/copy/core.js +4 -0
- package/dist/wizard/copy/core.js.map +1 -1
- package/dist/wizard/copy/data-science.d.ts +3 -0
- package/dist/wizard/copy/data-science.d.ts.map +1 -0
- package/dist/wizard/copy/data-science.js +15 -0
- package/dist/wizard/copy/data-science.js.map +1 -0
- package/dist/wizard/copy/index.d.ts.map +1 -1
- package/dist/wizard/copy/index.js +2 -0
- package/dist/wizard/copy/index.js.map +1 -1
- package/dist/wizard/copy/types.d.ts +5 -1
- package/dist/wizard/copy/types.d.ts.map +1 -1
- package/dist/wizard/copy/types.test-d.js +7 -0
- package/dist/wizard/copy/types.test-d.js.map +1 -1
- package/dist/wizard/questions.d.ts +2 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +9 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +14 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +1 -0
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-data-versioning
|
|
3
|
+
description: When and how to version data for reproducibility — size-based rule for choosing between git+Parquet, git-lfs, and DVC
|
|
4
|
+
topics: [data-science, data-versioning, dvc, parquet, git-lfs]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
If you can't answer "what data produced this result", you can't reproduce it.
|
|
8
|
+
A model trained on 2026-02-14's snapshot will drift from one trained today,
|
|
9
|
+
and without a versioning story you have no way to roll back, diff, or explain
|
|
10
|
+
the difference to a reviewer six months later.
|
|
11
|
+
|
|
12
|
+
Data versioning answers that question without blowing up the repo — the trick
|
|
13
|
+
is picking a tool proportional to the dataset size. The common failure mode
|
|
14
|
+
is over-engineering (wiring up DVC with a remote for a 40 MB CSV) or
|
|
15
|
+
under-engineering (committing 2 GB of Parquet directly into git and
|
|
16
|
+
discovering three months later that every clone takes twenty minutes).
|
|
17
|
+
|
|
18
|
+
## Summary
|
|
19
|
+
|
|
20
|
+
Pick your tool by size.
|
|
21
|
+
|
|
22
|
+
- Under ~1 GB of text or Parquet, plain git with committed Parquet files is fine.
|
|
23
|
+
- Between 1 and 10 GB, use `git-lfs` if you already use it on the team;
|
|
24
|
+
otherwise invest in `DVC` (Data Version Control).
|
|
25
|
+
- Above 10 GB — or for any binary artifact (model weights, image corpora,
|
|
26
|
+
audio) — always use DVC with a remote (s3, gcs, azure).
|
|
27
|
+
- Never version raw third-party data that you can't legally redistribute —
|
|
28
|
+
store a fetch script and a content hash instead.
|
|
29
|
+
|
|
30
|
+
## Deep Guidance
|
|
31
|
+
|
|
32
|
+
### Size-based decision rule
|
|
33
|
+
|
|
34
|
+
| Dataset | Tool | Why |
|
|
35
|
+
|---------|------|-----|
|
|
36
|
+
| <1 GB text / Parquet | git + Parquet | Columnar compression keeps files small; git history stays sane |
|
|
37
|
+
| 1–10 GB (judgment call) | `git-lfs` if already adopted; DVC if you have the habit | LFS is lower-effort; DVC gives you pipeline stages too |
|
|
38
|
+
| >10 GB or binary artifacts | DVC with remote | Git history will not tolerate binary churn at this scale |
|
|
39
|
+
| Raw third-party data | Don't version — script + hash | Redistribution is often prohibited; raw bytes bloat history |
|
|
40
|
+
|
|
41
|
+
The sizes above are rules of thumb, not hard thresholds. What actually
|
|
42
|
+
matters is how often the data changes. A 5 GB file that you generate once
|
|
43
|
+
and never touch again can live in `git-lfs` forever without pain. The same
|
|
44
|
+
5 GB file regenerated weekly will accumulate 260 GB of LFS storage in a
|
|
45
|
+
year — that's the point where DVC's content-addressed cache starts to earn
|
|
46
|
+
its complexity.
|
|
47
|
+
|
|
48
|
+
A second factor is team shape. A solo researcher on a laptop rarely needs a
|
|
49
|
+
remote backing store; a two-person team on different continents almost
|
|
50
|
+
always does. Choose the tool that fits the smallest real collaboration
|
|
51
|
+
pattern you have, not the one that scales to the team you imagine having.
|
|
52
|
+
|
|
53
|
+
### When git + Parquet is enough
|
|
54
|
+
|
|
55
|
+
For a solo or small-team project with modest data, commit processed Parquet directly. Keep raw data out of the repo; reserve git for cleaned, analysis-ready files.
|
|
56
|
+
|
|
57
|
+
```python
|
|
58
|
+
# src/pipelines/clean.py
|
|
59
|
+
import pandas as pd
|
|
60
|
+
|
|
61
|
+
df = pd.read_csv("data/raw/events.csv") # data/raw/ is gitignored
|
|
62
|
+
clean = df.dropna(subset=["user_id"]).assign(ts=pd.to_datetime(df["ts"]))
|
|
63
|
+
clean.to_parquet("data/interim/events_clean.parquet", compression="zstd")
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
```gitignore
|
|
67
|
+
# .gitignore
|
|
68
|
+
data/raw/
|
|
69
|
+
data/external/
|
|
70
|
+
*.csv
|
|
71
|
+
!data/interim/*.parquet # do commit processed Parquet
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Parquet's columnar layout and zstd compression typically shrink tabular data
|
|
75
|
+
5–10x versus CSV. Diffs aren't line-level but file-level content hashes are
|
|
76
|
+
stable, which is enough for "which version produced this model".
|
|
77
|
+
|
|
78
|
+
Pair the committed Parquet with a short data card — a markdown file in
|
|
79
|
+
`data/interim/events_clean.md` describing row count, schema, source, and the
|
|
80
|
+
commit that generated it — so readers of the repo a year later can tell what
|
|
81
|
+
they're looking at.
|
|
82
|
+
|
|
83
|
+
### DVC basics
|
|
84
|
+
|
|
85
|
+
DVC treats large files as pointers tracked in git. The real bytes live on a remote (s3/gcs/azure/ssh), and a small `.dvc` metadata file is committed.
|
|
86
|
+
|
|
87
|
+
```yaml
|
|
88
|
+
# dvc.yaml — pipeline stages with content-hashed inputs and outputs
|
|
89
|
+
stages:
|
|
90
|
+
ingest:
|
|
91
|
+
cmd: python src/ingest.py --out data/raw/events.parquet
|
|
92
|
+
outs:
|
|
93
|
+
- data/raw/events.parquet
|
|
94
|
+
process:
|
|
95
|
+
cmd: python src/process.py --in data/raw/events.parquet --out data/processed/features.parquet
|
|
96
|
+
deps:
|
|
97
|
+
- src/process.py
|
|
98
|
+
- data/raw/events.parquet
|
|
99
|
+
outs:
|
|
100
|
+
- data/processed/features.parquet
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Typical flow:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
dvc init # creates .dvc/ directory
|
|
107
|
+
dvc remote add -d storage s3://my-bucket/dvc-store
|
|
108
|
+
dvc add data/raw/big_dataset.csv # creates data/raw/big_dataset.csv.dvc (commit this)
|
|
109
|
+
dvc repro # runs stages whose inputs changed
|
|
110
|
+
dvc push # upload tracked files to remote
|
|
111
|
+
git add dvc.yaml dvc.lock data/raw/big_dataset.csv.dvc .gitignore
|
|
112
|
+
git commit -m "track raw events via DVC"
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
`dvc.lock` records the content hash of every stage input and output, so
|
|
116
|
+
`dvc repro` on a peer's machine rebuilds exactly what you rebuilt. The
|
|
117
|
+
`.dvc/` directory holds local cache and config; the actual bytes never touch
|
|
118
|
+
git.
|
|
119
|
+
|
|
120
|
+
Mental model: git for code, DVC for data, both pointing at the same commit.
|
|
121
|
+
When you check out an older branch, git restores the source and the `.dvc`
|
|
122
|
+
pointers, and `dvc checkout` pulls matching data from the remote into your
|
|
123
|
+
working tree.
|
|
124
|
+
|
|
125
|
+
A common starting point: track one or two heavy inputs with `dvc add` (no
|
|
126
|
+
pipeline), and only adopt `dvc.yaml` stages once you have a repeatable
|
|
127
|
+
multi-step workflow. The overhead of stages pays off when you have 3+ steps
|
|
128
|
+
and want `dvc repro` to skip unchanged work; below that, plain `dvc add`
|
|
129
|
+
plus a Makefile is often clearer.
|
|
130
|
+
|
|
131
|
+
### git-lfs middle ground
|
|
132
|
+
|
|
133
|
+
If you're already using Git LFS on the team but not ready to adopt DVC, it works acceptably for the 1–10 GB band — especially for a handful of files over the 100 MB GitHub push limit.
|
|
134
|
+
|
|
135
|
+
```gitattributes
|
|
136
|
+
# .gitattributes
|
|
137
|
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
|
138
|
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
|
139
|
+
data/models/** filter=lfs diff=lfs merge=lfs -text
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
git lfs install
|
|
144
|
+
git lfs track "*.parquet"
|
|
145
|
+
git add .gitattributes data/features.parquet
|
|
146
|
+
git commit -m "add feature table via LFS"
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Reach for `git-lfs` when files are over ~100 MB but you don't need DVC's
|
|
150
|
+
pipeline stages or content-addressed reproducibility. Skip it if you already
|
|
151
|
+
have DVC set up — two tools versioning the same data is a recipe for
|
|
152
|
+
confusion.
|
|
153
|
+
|
|
154
|
+
LFS has real drawbacks to know about: bandwidth is metered on hosted plans,
|
|
155
|
+
`git clone` pulls every LFS object by default (use `GIT_LFS_SKIP_SMUDGE=1`
|
|
156
|
+
to defer), and you can't selectively prune history without rewriting the
|
|
157
|
+
whole repo. For a working group of 2–5 people on a research project these
|
|
158
|
+
are usually tolerable; for a fleet of CI workers cloning on every build
|
|
159
|
+
they are not.
|
|
160
|
+
|
|
161
|
+
### What not to version
|
|
162
|
+
|
|
163
|
+
- **Third-party data with license constraints** — re-commit a fetch script (`scripts/fetch_kaggle.sh`) and record the SHA256 of the pulled file in a README. Re-download on each environment.
|
|
164
|
+
- **Regenerable intermediates** — if `dvc repro` or `make data` can recreate it deterministically from upstream inputs, don't commit the bytes.
|
|
165
|
+
- **Scratch / exploratory outputs** — `notebooks/scratch/`, `data/tmp/`, `*.ipynb_checkpoints/` belong in `.gitignore`.
|
|
166
|
+
- **Anti-pattern: committing 500 MB Parquet files directly to git** — they live forever in history, clone times balloon, and nobody will clean it up later. Move to DVC or LFS *before* the first large commit, not after. Rewriting history to extract large blobs (`git filter-repo`, BFG) is disruptive to every collaborator and should be a last resort.
|
|
167
|
+
- **Anti-pattern: versioning model checkpoints in git** — a single PyTorch checkpoint can be several hundred MB, and training runs produce dozens. Push them to DVC or an artifact store (MLflow, Weights & Biases) keyed by experiment run ID.
|
|
168
|
+
|
|
169
|
+
### Quick migration path
|
|
170
|
+
|
|
171
|
+
If you're staring at a repo that has already committed large files to plain
|
|
172
|
+
git, the order of operations is:
|
|
173
|
+
|
|
174
|
+
1. Decide the target tool (DVC for most cases where you got here).
|
|
175
|
+
2. Run `dvc add` on the file in its current location — this untracks it
|
|
176
|
+
from git and creates a `.dvc` pointer.
|
|
177
|
+
3. Commit the pointer and the updated `.gitignore`.
|
|
178
|
+
4. Optionally run `git filter-repo` to purge the old blobs from history if
|
|
179
|
+
clone size has become painful.
|
|
180
|
+
|
|
181
|
+
Step 4 requires coordination — everyone must re-clone — so defer it until
|
|
182
|
+
the pain justifies the disruption.
|
|
183
|
+
|
|
184
|
+
### Reproducibility in practice
|
|
185
|
+
|
|
186
|
+
The goal of all of this is a single concrete question: given a git commit,
|
|
187
|
+
can a teammate rebuild the exact model artifact that the commit describes?
|
|
188
|
+
Answer yes by pinning three things together:
|
|
189
|
+
|
|
190
|
+
- **Code** — the git commit itself.
|
|
191
|
+
- **Data** — a `.dvc` pointer, an LFS object, or a committed Parquet file,
|
|
192
|
+
all content-hashed.
|
|
193
|
+
- **Environment** — a pinned `requirements.txt`, `pyproject.toml`, or
|
|
194
|
+
`conda-lock.yml` committed in the same commit.
|
|
195
|
+
|
|
196
|
+
If any one of those three is missing, reproducibility is accidental. The
|
|
197
|
+
versioning tool you pick is less important than treating the three as a
|
|
198
|
+
single atomic unit — changed together, reviewed together, reverted together.
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-dev-environment
|
|
3
|
+
description: Reproducible local Python dev environment for data science using uv, direnv, pre-commit, and pyproject.toml
|
|
4
|
+
topics: [data-science, dev-environment, uv, direnv, pre-commit]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A data-science project that cannot be recreated in minutes is a liability. Notebooks pick up stale package versions, secrets leak into `.bashrc`, and "works on my machine" kills any chance of a collaborator (or future-you) rerunning an experiment. The fix is not complicated: one lockfile, one place for env vars, one pre-commit hook, no bespoke shell scripts. This guide is opinionated toward solo and small-team workflows where local-first beats container-first.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use `uv` as the single Python package manager — it replaces `pip`, `pip-tools`, `venv`, and `virtualenv` with one fast, reproducible tool. Declare every dependency in `pyproject.toml` so there is exactly one source of truth, and commit `uv.lock` so `uv sync` gives any collaborator a byte-identical environment. Layer `direnv` on top for per-repo environment variables (tracking URIs, data paths, secrets pulled from a vault) so nothing leaks into your global shell. Add `pre-commit` with a small set of fast hooks (`ruff-format`, `ruff-check`, end-of-file fixer) so style and obvious bugs never enter a commit. Skip Docker for greenfield solo DS work — reach for it only when you cross an OS boundary (Mac dev, Linux prod) or depend on GPU/CUDA libraries.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### uv for Python environment and dependencies
|
|
16
|
+
|
|
17
|
+
`uv` is the 2025 default for Python packaging. It is a drop-in replacement for `pip`, `venv`, and `pip-tools`, written in Rust, and roughly 10-100x faster than the tools it replaces. For data science the combination of `uv sync` (reproduces the environment from the lockfile) and `uv run` (executes a script in the managed venv without activation) is the whole workflow.
|
|
18
|
+
|
|
19
|
+
Bootstrap a new project:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
uv init --python 3.12 myproject
|
|
23
|
+
cd myproject
|
|
24
|
+
uv add pandas numpy scikit-learn jupyterlab
|
|
25
|
+
uv add --dev ruff pytest pandera
|
|
26
|
+
uv sync # creates .venv and installs everything
|
|
27
|
+
uv run pytest # runs in the managed venv, no activation needed
|
|
28
|
+
uv run jupyter lab
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
A minimal `pyproject.toml`:
|
|
32
|
+
|
|
33
|
+
```toml
|
|
34
|
+
[project]
|
|
35
|
+
name = "myproject"
|
|
36
|
+
version = "0.1.0"
|
|
37
|
+
description = "Customer churn analysis"
|
|
38
|
+
requires-python = ">=3.12"
|
|
39
|
+
dependencies = [
|
|
40
|
+
"pandas>=2.2",
|
|
41
|
+
"numpy>=2.0",
|
|
42
|
+
"scikit-learn>=1.5",
|
|
43
|
+
"jupyterlab>=4.2",
|
|
44
|
+
"pandera>=0.20",
|
|
45
|
+
]
|
|
46
|
+
|
|
47
|
+
[dependency-groups]
|
|
48
|
+
dev = [
|
|
49
|
+
"ruff>=0.6",
|
|
50
|
+
"pytest>=8.0",
|
|
51
|
+
"pre-commit>=3.8",
|
|
52
|
+
]
|
|
53
|
+
|
|
54
|
+
[tool.ruff]
|
|
55
|
+
line-length = 100
|
|
56
|
+
target-version = "py312"
|
|
57
|
+
|
|
58
|
+
[tool.ruff.lint]
|
|
59
|
+
select = ["E", "F", "I", "B", "UP", "PD"] # PD = pandas-vet
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Commit both `pyproject.toml` and `uv.lock`. A collaborator clones the repo and runs `uv sync` — that is the entire setup step. No `pip install -r requirements.txt`, no virtualenv activation, no version drift.
|
|
63
|
+
|
|
64
|
+
Add a package with `uv add <name>`; remove with `uv remove <name>`. Both edit `pyproject.toml` and update `uv.lock` atomically. To pin a version use `uv add "pandas==2.2.3"`. To upgrade run `uv lock --upgrade-package pandas`.
|
|
65
|
+
|
|
66
|
+
Two `uv` features worth knowing for data science specifically:
|
|
67
|
+
|
|
68
|
+
- **`uv run script.py`** executes a file in the project venv with no activation step. Wire this into a `Makefile` or `justfile` so `make train` and `make eval` Just Work for any collaborator.
|
|
69
|
+
- **Inline script metadata (PEP 723).** For one-off analysis scripts that live outside the project, a shebang-style header declares dependencies and `uv run` auto-creates an ephemeral venv:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
# /// script
|
|
73
|
+
# requires-python = ">=3.12"
|
|
74
|
+
# dependencies = ["pandas", "duckdb"]
|
|
75
|
+
# ///
|
|
76
|
+
import pandas as pd
|
|
77
|
+
import duckdb
|
|
78
|
+
...
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Running `uv run oneoff.py` resolves and caches those deps transparently. No more "should I add this to `pyproject.toml`?" for throwaway exploration.
|
|
82
|
+
|
|
83
|
+
### direnv for env vars
|
|
84
|
+
|
|
85
|
+
`direnv` loads a per-directory `.envrc` file whenever you `cd` into the project. It keeps secrets and tracking URIs out of your global shell and ensures every terminal session sees the same variables. Skip it if your project has no environment variables; add it the first time you reach for one.
|
|
86
|
+
|
|
87
|
+
Install once (`brew install direnv`, then hook into your shell per the docs). In the project:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# .envrc — commit this file
|
|
91
|
+
use python .venv/bin/python # pin Python to the uv-managed venv
|
|
92
|
+
source_up # inherit vars from parent .envrc if present
|
|
93
|
+
|
|
94
|
+
# Experiment tracking
|
|
95
|
+
export MLFLOW_TRACKING_URI="http://localhost:5000"
|
|
96
|
+
|
|
97
|
+
# Data paths (relative to repo root)
|
|
98
|
+
export DATA_DIR="$PWD/data"
|
|
99
|
+
export MODELS_DIR="$PWD/models"
|
|
100
|
+
|
|
101
|
+
# Make imports work without installing the package
|
|
102
|
+
export PYTHONPATH="$PWD/src:$PYTHONPATH"
|
|
103
|
+
|
|
104
|
+
# Secrets — source from a local-only file, never commit
|
|
105
|
+
[[ -f .envrc.local ]] && source_env .envrc.local
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Add `.envrc.local` to `.gitignore` and put any actual secrets there (API keys, DB passwords). Run `direnv allow` once after creating or editing `.envrc`; `direnv` will refuse to load until you do. The moment you `cd` out of the project, all variables unload — no pollution.
|
|
109
|
+
|
|
110
|
+
### pre-commit hooks
|
|
111
|
+
|
|
112
|
+
`pre-commit` runs a configured set of checks every time you `git commit`. Keep the hook list short and fast — anything slower than a second or two trains you to use `--no-verify`, which defeats the point. For data science the right starter set is format, lint, and a couple of sanity hooks.
|
|
113
|
+
|
|
114
|
+
```yaml
|
|
115
|
+
# .pre-commit-config.yaml
|
|
116
|
+
repos:
|
|
117
|
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
118
|
+
rev: v0.6.9
|
|
119
|
+
hooks:
|
|
120
|
+
- id: ruff-format
|
|
121
|
+
- id: ruff-check
|
|
122
|
+
args: [--fix]
|
|
123
|
+
|
|
124
|
+
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
125
|
+
rev: v5.0.0
|
|
126
|
+
hooks:
|
|
127
|
+
- id: end-of-file-fixer
|
|
128
|
+
- id: trailing-whitespace
|
|
129
|
+
- id: check-yaml
|
|
130
|
+
- id: check-added-large-files
|
|
131
|
+
args: [--maxkb=500] # block accidental dataset commits
|
|
132
|
+
|
|
133
|
+
- repo: https://github.com/kynan/nbstripout
|
|
134
|
+
rev: 0.7.1
|
|
135
|
+
hooks:
|
|
136
|
+
- id: nbstripout # strips notebook outputs before commit
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Install the git hook once per clone:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
uv run pre-commit install
|
|
143
|
+
uv run pre-commit run --all-files # bootstrap: fix everything now
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Deliberately excluded from this set: `mypy`, `pytest`, and `bandit`. They are all valuable, but they are slow enough that they belong in CI, not in the commit path. Fast local, thorough remote.
|
|
147
|
+
|
|
148
|
+
### When to add Docker
|
|
149
|
+
|
|
150
|
+
For greenfield solo data science, Docker is overhead you do not need. `uv sync` already gives you reproducibility on any machine with the same OS, and local iteration is faster without a container layer.
|
|
151
|
+
|
|
152
|
+
Reach for Docker only when one of these is true:
|
|
153
|
+
|
|
154
|
+
- **OS mismatch between dev and prod.** You develop on macOS but the model runs on Linux in production, and a native dependency (e.g. a C extension, a specific `libgomp`) behaves differently across platforms.
|
|
155
|
+
- **GPU / CUDA dependencies.** CUDA toolkit versions are tightly coupled to driver versions and OS. A pinned `nvidia/cuda` base image is the only sane way to guarantee training reproducibility across machines.
|
|
156
|
+
- **Handoff to MLOps or serving infra.** Production deployment targets (SageMaker, Vertex, KServe, plain Kubernetes) expect a container. Build one at the handoff boundary, not before.
|
|
157
|
+
- **Onboarding collaborators with hostile local setups.** A Windows colleague who cannot install `uv` natively is a reasonable reason to ship a devcontainer.
|
|
158
|
+
|
|
159
|
+
When you do add Docker, keep it thin: copy `pyproject.toml` and `uv.lock`, run `uv sync --frozen`, and let the same lockfile drive both local and container builds. That way the container is a packaging detail, not a parallel source of truth.
|
|
@@ -0,0 +1,194 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-experiment-tracking
|
|
3
|
+
description: Local MLflow setup, run instrumentation, git commit tagging, and run comparison for solo and small-team data science work
|
|
4
|
+
topics: [data-science, experiment-tracking, mlflow, weights-and-biases, reproducibility]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Without experiment tracking, data science becomes archaeology: three weeks after a promising result, a stakeholder asks "which config produced that number?" and answering it turns into a forensic exercise — sifting through notebook history, Slack messages, and commented-out cells. A lightweight experiment tracker fixes this with one discipline: every run logs its hyperparameters, metrics, artifacts, and the git commit SHA that produced it. For a solo DS or small team, you do not need a shared server or a cloud account — a local MLflow instance on SQLite is enough to get the full benefit, and you can graduate to a shared deployment later without changing the instrumentation.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Self-host MLflow locally with a SQLite backend and a local artifact directory — it is the minimum setup that still gives you a queryable run history, a browsable UI, and reproducible run IDs. Every run logs the full hyperparameter dict, metrics per epoch (or iteration), the git commit SHA as a tag, dataset version, and any config or report artifacts. Weights & Biases is a reasonable cloud alternative if you value the polished UI and do not mind cloud storage — but for a DS-1 setup it is not the primary recommendation. Never log PII into run metadata or artifacts, and never commit `mlflow.db` or `mlartifacts/` to git.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### What to log per run
|
|
16
|
+
|
|
17
|
+
Treat every training run, hyperparameter tweak, or evaluation pass as a tracked experiment — even the exploratory ones you think will be throwaway. The cost of logging is trivial; the cost of not logging a run that turns out to matter is measured in hours of re-running and second-guessing. The minimum payload is:
|
|
18
|
+
|
|
19
|
+
- **Hyperparameters**: the full config dict (learning rate, batch size, seed, feature set, model type, loss weights, regularization). Log it all — future-you does not know which knob will matter and adding knobs retroactively is impossible.
|
|
20
|
+
- **Metrics**: logged with `step=epoch` (or `step=iteration`) so the UI can render a time-series plot. Log train and validation metrics side by side; a single final-value log loses the overfitting story.
|
|
21
|
+
- **Git commit SHA**: a tag pointing to the exact commit that produced the run. Without this, "reproduce run 47" is unanswerable, because the config alone does not capture code changes in the training loop, data loader, or feature engineering.
|
|
22
|
+
- **Dataset version**: a tag or param identifying which dataset snapshot was used — a DVC hash, a filename with a date suffix, or a data commit SHA. Without this, "reproduce run 47" is still unanswerable even if you have the code, because the data moved underneath it.
|
|
23
|
+
- **Run name**: a human-readable name (`baseline-v3-with-dropout`) so the UI list is browsable without clicking every row to read the params.
|
|
24
|
+
- **Artifacts**: the resolved config YAML, the evaluation report JSON, any confusion matrix images, and the final model checkpoint. Small artifacts go inline with the run; large model weights can be stored by reference.
|
|
25
|
+
|
|
26
|
+
### MLflow self-hosted setup
|
|
27
|
+
|
|
28
|
+
Run the tracking server locally. SQLite is the right backend for a solo workflow — it gives you the full query API without the ops burden of Postgres, and the `mlflow.db` file is small enough that you can zip and share it with a collaborator if you really need to:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
mlflow server \
|
|
32
|
+
--backend-store-uri sqlite:///mlflow.db \
|
|
33
|
+
--default-artifact-root ./mlartifacts \
|
|
34
|
+
--host 127.0.0.1 --port 5000
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Bind to `127.0.0.1` rather than `0.0.0.0` so you do not accidentally expose an unauthenticated tracking server to your network. Leave it running in a terminal tab, a `tmux` pane, or under `launchd`/`systemd` — whatever keeps it up between sessions.
|
|
38
|
+
|
|
39
|
+
Point your code at the server via an environment variable. Using `direnv` keeps this per-project and avoids polluting your shell:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
# .envrc
|
|
43
|
+
export MLFLOW_TRACKING_URI=http://localhost:5000
|
|
44
|
+
export MLFLOW_EXPERIMENT_NAME=churn-baseline
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
Add the tracking artifacts to `.gitignore` — they are large, local, and not reproducible from source. Committing them bloats the repo and leaks local-path metadata into history:
|
|
48
|
+
|
|
49
|
+
```gitignore
|
|
50
|
+
# .gitignore
|
|
51
|
+
mlflow.db
|
|
52
|
+
mlflow.db-journal
|
|
53
|
+
mlartifacts/
|
|
54
|
+
mlruns/
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
When you later graduate to a shared MLflow server (team deployment, S3 artifact store, Postgres backend), the only change is the `MLFLOW_TRACKING_URI` — your instrumentation code stays identical, and historical runs stay on your laptop as a personal archive.
|
|
58
|
+
|
|
59
|
+
### Instrumenting a training / experiment run
|
|
60
|
+
|
|
61
|
+
Wrap the training loop in `mlflow.start_run`. The context manager handles start and end timestamps, guarantees the run closes even on exception, and exposes `run.info.run_id` — the stable handle you use later for comparison, export, or model loading:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
import subprocess
|
|
65
|
+
import mlflow
|
|
66
|
+
import yaml
|
|
67
|
+
|
|
68
|
+
mlflow.set_tracking_uri("http://localhost:5000")
|
|
69
|
+
mlflow.set_experiment("churn-baseline")
|
|
70
|
+
|
|
71
|
+
def train(cfg: dict) -> dict:
|
|
72
|
+
with mlflow.start_run(run_name=cfg["experiment"]["name"]) as run:
|
|
73
|
+
# Log full hyperparameter dict (flatten nested keys to dot-paths)
|
|
74
|
+
mlflow.log_params(_flatten(cfg))
|
|
75
|
+
|
|
76
|
+
# Reproducibility tags — git commit is the single most important one
|
|
77
|
+
git_sha = subprocess.check_output(
|
|
78
|
+
["git", "rev-parse", "HEAD"]
|
|
79
|
+
).decode().strip()
|
|
80
|
+
mlflow.set_tag("git_commit", git_sha)
|
|
81
|
+
mlflow.set_tag("dataset_version", cfg["data"]["version"])
|
|
82
|
+
mlflow.set_tag("model_type", cfg["model"]["type"])
|
|
83
|
+
|
|
84
|
+
# Per-epoch metrics — step=epoch is what gives you a time-series plot
|
|
85
|
+
for epoch in range(cfg["training"]["epochs"]):
|
|
86
|
+
train_metrics = train_epoch(...)
|
|
87
|
+
val_metrics = evaluate(...)
|
|
88
|
+
mlflow.log_metrics({
|
|
89
|
+
"train_loss": train_metrics["loss"],
|
|
90
|
+
"val_loss": val_metrics["loss"],
|
|
91
|
+
"val_auc": val_metrics["auc"],
|
|
92
|
+
}, step=epoch)
|
|
93
|
+
|
|
94
|
+
# Artifacts: resolved config + eval report
|
|
95
|
+
with open("configs/resolved.yaml", "w") as f:
|
|
96
|
+
yaml.safe_dump(cfg, f)
|
|
97
|
+
mlflow.log_artifact("configs/resolved.yaml")
|
|
98
|
+
mlflow.log_artifact("reports/eval_report.json")
|
|
99
|
+
|
|
100
|
+
return {"run_id": run.info.run_id, **val_metrics}
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
A few notes on the shape of this code. `mlflow.log_params` takes a flat dict, so a helper like `_flatten` turns `{"optimizer": {"lr": 1e-3}}` into `{"optimizer.lr": "0.001"}` — values are coerced to strings. Log the **resolved** config after any CLI overrides or hydra composition, not the raw file on disk, so the stored params match what actually ran. If the working tree is dirty at training time, either commit first or log `git status --porcelain` output as a tag so you can tell the logged commit is not the whole story. Keep the returned `run_id` — it is the primary key you will use to find this run in the UI, export its metadata, register its model later, or reference it from a downstream evaluation run via `mlflow.set_tag("parent_run_id", ...)`.
|
|
104
|
+
|
|
105
|
+
### Run comparison and selection
|
|
106
|
+
|
|
107
|
+
Open the MLflow UI at `http://localhost:5000`. The three views that earn their keep:
|
|
108
|
+
|
|
109
|
+
- **Run list** — sort by `metrics.val_auc` or filter by `tags.git_commit = "<sha>"`. Tag filters are the fastest way to find "the runs I launched from this branch." Sort by columns to see the run_id of your best-performing experiment, then click through for the full picture.
|
|
110
|
+
- **Parallel coordinates plot** — select several runs, switch to the parallel coordinates view, and see which hyperparameters correlate with your target metric. This is the view that turns dozens of runs into a readable pattern — hover a line to see the full config, drag axes to filter a band, and the plot re-paints to show only the runs that meet your criterion.
|
|
111
|
+
- **Metric plot** — overlay `val_loss` across selected runs to spot overfitting (train loss drops, val loss rises), bad seeds (wildly different trajectories with the same config), or early-stopping candidates (val metric plateaued ten epochs before training ended).
|
|
112
|
+
|
|
113
|
+
You can also query programmatically when the UI's filters are not expressive enough:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
from mlflow.tracking import MlflowClient
|
|
117
|
+
import pandas as pd
|
|
118
|
+
|
|
119
|
+
client = MlflowClient()
|
|
120
|
+
runs = client.search_runs(
|
|
121
|
+
experiment_ids=[client.get_experiment_by_name("churn-baseline").experiment_id],
|
|
122
|
+
filter_string="metrics.val_auc > 0.82 and tags.dataset_version = '2026-03'",
|
|
123
|
+
order_by=["metrics.val_auc DESC"],
|
|
124
|
+
max_results=20,
|
|
125
|
+
)
|
|
126
|
+
df = pd.DataFrame([{
|
|
127
|
+
"run_id": r.info.run_id,
|
|
128
|
+
"name": r.info.run_name,
|
|
129
|
+
"val_auc": r.data.metrics.get("val_auc"),
|
|
130
|
+
"lr": r.data.params.get("optimizer.lr"),
|
|
131
|
+
} for r in runs])
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
When you have a winner, export its config back into the repo for a clean retrain:
|
|
135
|
+
|
|
136
|
+
```python
|
|
137
|
+
run = client.get_run("<run_id>")
|
|
138
|
+
client.download_artifacts(run.info.run_id, "resolved.yaml", dst_path="configs/")
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Commit the exported config so "run 47's exact recipe" becomes a file in git, not a memory and not a database row that lives only on your laptop.
|
|
142
|
+
|
|
143
|
+
### Weights & Biases as alternative
|
|
144
|
+
|
|
145
|
+
Weights & Biases is the polished cloud alternative. It has a richer UI, built-in system metric logging (GPU, memory, temperature), gradient histograms via `wandb.watch(model, log="gradients")`, media logging (images, audio, tables, confusion matrices rendered inline), and better collaboration features — named reports, shared dashboards, thread-style comments. For a small team that has already decided it is comfortable with cloud storage, W&B removes the "who runs the server" question entirely, and the onboarding for a new teammate is `pip install wandb && wandb login`.
|
|
146
|
+
|
|
147
|
+
The instrumentation shape is familiar:
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
import wandb
|
|
151
|
+
wandb.init(project="churn-baseline", name=cfg["experiment"]["name"],
|
|
152
|
+
config=cfg, tags=["baseline", "v2-features"])
|
|
153
|
+
for epoch in range(cfg["training"]["epochs"]):
|
|
154
|
+
wandb.log({"epoch": epoch, "val/auc": val_auc, "train/loss": loss})
|
|
155
|
+
wandb.finish()
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
Two things to weigh before picking it over MLflow for a DS-1 setup. First, the free tier has trial limits on private projects, artifact retention, and seats — fine for a solo experimenter, worth pricing out before a team commits. Second, you are shipping your experiment metadata (and potentially artifacts) to a third party, which matters if your dataset values, config parameters, or run names might accidentally encode something sensitive. MLflow self-hosted stays on your laptop; W&B lives in someone else's cloud. If your org has a data-residency or vendor-review process, MLflow skips it entirely.
|
|
159
|
+
|
|
160
|
+
### Graduating from solo to shared
|
|
161
|
+
|
|
162
|
+
The path from "SQLite on my laptop" to "shared team tracker" is short and deliberately low-risk, because your instrumentation already speaks the MLflow protocol:
|
|
163
|
+
|
|
164
|
+
1. **Stand up a shared MLflow server** behind your internal network — Postgres backend, S3 or equivalent object store for artifacts, authentication in front (oauth2-proxy, nginx basic auth, or a cloud load balancer).
|
|
165
|
+
2. **Flip the tracking URI** in each project's `.envrc` to point at the shared server. No code changes.
|
|
166
|
+
3. **Optionally backfill historic runs** using `mlflow artifacts download` plus `search_runs`, then re-log to the shared server — only worth it for runs you want the team to see.
|
|
167
|
+
4. **Keep the local server configured** for offline or air-gapped work — you can still set `MLFLOW_TRACKING_URI=file:./mlruns` for quick local-only iteration when the shared server is down or you are on a plane.
|
|
168
|
+
|
|
169
|
+
This is why the self-hosted local setup is the right default even if you know you will eventually run a team server: the instrumentation you write today is exactly the instrumentation that will talk to production tomorrow.
|
|
170
|
+
|
|
171
|
+
### Nested runs for sweeps and evaluations
|
|
172
|
+
|
|
173
|
+
When you run a small hyperparameter sweep from a laptop — a few learning rates, two or three seeds — use MLflow's nested runs rather than one flat run per trial. A parent run captures the sweep-level config and best-metric summary; each trial becomes a child with its own params and metrics:
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
with mlflow.start_run(run_name="lr-sweep") as parent:
|
|
177
|
+
mlflow.log_param("sweep_type", "grid")
|
|
178
|
+
best_auc = 0.0
|
|
179
|
+
for lr in [1e-4, 3e-4, 1e-3, 3e-3]:
|
|
180
|
+
with mlflow.start_run(run_name=f"lr={lr}", nested=True) as child:
|
|
181
|
+
mlflow.log_param("lr", lr)
|
|
182
|
+
val_auc = train_one(lr)
|
|
183
|
+
mlflow.log_metric("val_auc", val_auc)
|
|
184
|
+
best_auc = max(best_auc, val_auc)
|
|
185
|
+
mlflow.log_metric("best_val_auc", best_auc)
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
In the UI, the parent shows an expandable tree of children, which keeps the run list navigable once you have hundreds of rows. For larger sweeps, pair MLflow with Optuna (`mlflow.start_run(nested=True)` inside an `objective` function) — you get Bayesian search on top of MLflow's persistence.
|
|
189
|
+
|
|
190
|
+
### Hygiene
|
|
191
|
+
|
|
192
|
+
**Never log PII.** Experiment metadata and artifacts are easy to share with collaborators, screenshot into a ticket, or accidentally export to a cloud UI when you later migrate to W&B or a shared MLflow. Hyperparameter values, metric names, run names, tags, and artifact contents must all be free of customer identifiers, emails, names, or raw records. If a config carries a data path, keep it at the dataset level (`data/processed/churn_2026_03.parquet`) — never at the row or user level (`data/users/alice@example.com/history.parquet`). Evaluation reports that include example predictions must redact any PII in the input columns before being logged as artifacts. See `data-science-security.md` for the broader no-PII rules that apply across the whole DS workflow.
|
|
193
|
+
|
|
194
|
+
**Never commit the tracking store.** `mlflow.db`, `mlflow.db-journal`, `mlartifacts/`, and `mlruns/` all belong in `.gitignore`. They are large (easily hundreds of MB once you log a few model checkpoints), machine-local (paths and timestamps are specific to your laptop), and not reproducible from git. If you need a teammate to see a run, export the specific artifacts you want to share via `mlflow artifacts download --run-id <id>` or by running a shared tracking server — do not push the whole store into the repo and do not email `mlflow.db` as an attachment.
|