@zigrivers/scaffold 3.21.0 → 3.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (124) hide show
  1. package/README.md +21 -7
  2. package/content/knowledge/data-science/README.md +23 -0
  3. package/content/knowledge/data-science/data-science-architecture.md +163 -0
  4. package/content/knowledge/data-science/data-science-conventions.md +233 -0
  5. package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
  6. package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
  7. package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
  8. package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
  9. package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
  10. package/content/knowledge/data-science/data-science-observability.md +161 -0
  11. package/content/knowledge/data-science/data-science-project-structure.md +178 -0
  12. package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
  13. package/content/knowledge/data-science/data-science-requirements.md +151 -0
  14. package/content/knowledge/data-science/data-science-security.md +151 -0
  15. package/content/knowledge/data-science/data-science-testing.md +183 -0
  16. package/content/knowledge/ml/README.md +10 -0
  17. package/content/methodology/data-science-overlay.yml +39 -0
  18. package/dist/cli/commands/dashboard.d.ts.map +1 -1
  19. package/dist/cli/commands/dashboard.js +40 -0
  20. package/dist/cli/commands/dashboard.js.map +1 -1
  21. package/dist/config/schema.d.ts +672 -126
  22. package/dist/config/schema.d.ts.map +1 -1
  23. package/dist/config/schema.js +8 -0
  24. package/dist/config/schema.js.map +1 -1
  25. package/dist/config/schema.test.js +2 -2
  26. package/dist/config/schema.test.js.map +1 -1
  27. package/dist/config/validators/data-science.d.ts +4 -0
  28. package/dist/config/validators/data-science.d.ts.map +1 -0
  29. package/dist/config/validators/data-science.js +15 -0
  30. package/dist/config/validators/data-science.js.map +1 -0
  31. package/dist/config/validators/index.d.ts.map +1 -1
  32. package/dist/config/validators/index.js +2 -0
  33. package/dist/config/validators/index.js.map +1 -1
  34. package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
  35. package/dist/core/assembly/knowledge-loader.js +6 -0
  36. package/dist/core/assembly/knowledge-loader.js.map +1 -1
  37. package/dist/core/assembly/knowledge-loader.test.js +34 -0
  38. package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
  39. package/dist/dashboard/dependency-graph.d.ts +19 -0
  40. package/dist/dashboard/dependency-graph.d.ts.map +1 -0
  41. package/dist/dashboard/dependency-graph.js +180 -0
  42. package/dist/dashboard/dependency-graph.js.map +1 -0
  43. package/dist/dashboard/dependency-graph.test.d.ts +2 -0
  44. package/dist/dashboard/dependency-graph.test.d.ts.map +1 -0
  45. package/dist/dashboard/dependency-graph.test.js +409 -0
  46. package/dist/dashboard/dependency-graph.test.js.map +1 -0
  47. package/dist/dashboard/generator.d.ts +46 -0
  48. package/dist/dashboard/generator.d.ts.map +1 -1
  49. package/dist/dashboard/generator.js +1 -0
  50. package/dist/dashboard/generator.js.map +1 -1
  51. package/dist/dashboard/multi-service.test.js +257 -1
  52. package/dist/dashboard/multi-service.test.js.map +1 -1
  53. package/dist/dashboard/template.d.ts +13 -0
  54. package/dist/dashboard/template.d.ts.map +1 -1
  55. package/dist/dashboard/template.js +176 -0
  56. package/dist/dashboard/template.js.map +1 -1
  57. package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts +2 -0
  58. package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts.map +1 -0
  59. package/dist/e2e/dashboard-cross-service-graph-wiring.test.js +130 -0
  60. package/dist/e2e/dashboard-cross-service-graph-wiring.test.js.map +1 -0
  61. package/dist/e2e/dashboard-cross-service-graph.test.d.ts +2 -0
  62. package/dist/e2e/dashboard-cross-service-graph.test.d.ts.map +1 -0
  63. package/dist/e2e/dashboard-cross-service-graph.test.js +216 -0
  64. package/dist/e2e/dashboard-cross-service-graph.test.js.map +1 -0
  65. package/dist/e2e/project-type-overlays.test.js +73 -0
  66. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  67. package/dist/project/adopt.d.ts.map +1 -1
  68. package/dist/project/adopt.js +3 -1
  69. package/dist/project/adopt.js.map +1 -1
  70. package/dist/project/detectors/coverage.test.d.ts +2 -0
  71. package/dist/project/detectors/coverage.test.d.ts.map +1 -0
  72. package/dist/project/detectors/coverage.test.js +78 -0
  73. package/dist/project/detectors/coverage.test.js.map +1 -0
  74. package/dist/project/detectors/data-science.d.ts +4 -0
  75. package/dist/project/detectors/data-science.d.ts.map +1 -0
  76. package/dist/project/detectors/data-science.js +32 -0
  77. package/dist/project/detectors/data-science.js.map +1 -0
  78. package/dist/project/detectors/data-science.test.d.ts +2 -0
  79. package/dist/project/detectors/data-science.test.d.ts.map +1 -0
  80. package/dist/project/detectors/data-science.test.js +62 -0
  81. package/dist/project/detectors/data-science.test.js.map +1 -0
  82. package/dist/project/detectors/disambiguate.d.ts +2 -0
  83. package/dist/project/detectors/disambiguate.d.ts.map +1 -1
  84. package/dist/project/detectors/disambiguate.js +3 -2
  85. package/dist/project/detectors/disambiguate.js.map +1 -1
  86. package/dist/project/detectors/disambiguate.test.js +10 -1
  87. package/dist/project/detectors/disambiguate.test.js.map +1 -1
  88. package/dist/project/detectors/index.d.ts.map +1 -1
  89. package/dist/project/detectors/index.js +2 -0
  90. package/dist/project/detectors/index.js.map +1 -1
  91. package/dist/project/detectors/library.d.ts.map +1 -1
  92. package/dist/project/detectors/library.js +1 -0
  93. package/dist/project/detectors/library.js.map +1 -1
  94. package/dist/project/detectors/resolve-detection.test.js +31 -0
  95. package/dist/project/detectors/resolve-detection.test.js.map +1 -1
  96. package/dist/project/detectors/types.d.ts +6 -2
  97. package/dist/project/detectors/types.d.ts.map +1 -1
  98. package/dist/project/detectors/types.js.map +1 -1
  99. package/dist/types/config.d.ts +8 -1
  100. package/dist/types/config.d.ts.map +1 -1
  101. package/dist/wizard/copy/core.d.ts.map +1 -1
  102. package/dist/wizard/copy/core.js +4 -0
  103. package/dist/wizard/copy/core.js.map +1 -1
  104. package/dist/wizard/copy/data-science.d.ts +3 -0
  105. package/dist/wizard/copy/data-science.d.ts.map +1 -0
  106. package/dist/wizard/copy/data-science.js +15 -0
  107. package/dist/wizard/copy/data-science.js.map +1 -0
  108. package/dist/wizard/copy/index.d.ts.map +1 -1
  109. package/dist/wizard/copy/index.js +2 -0
  110. package/dist/wizard/copy/index.js.map +1 -1
  111. package/dist/wizard/copy/types.d.ts +5 -1
  112. package/dist/wizard/copy/types.d.ts.map +1 -1
  113. package/dist/wizard/copy/types.test-d.js +7 -0
  114. package/dist/wizard/copy/types.test-d.js.map +1 -1
  115. package/dist/wizard/questions.d.ts +2 -1
  116. package/dist/wizard/questions.d.ts.map +1 -1
  117. package/dist/wizard/questions.js +9 -1
  118. package/dist/wizard/questions.js.map +1 -1
  119. package/dist/wizard/questions.test.js +14 -0
  120. package/dist/wizard/questions.test.js.map +1 -1
  121. package/dist/wizard/wizard.d.ts.map +1 -1
  122. package/dist/wizard/wizard.js +1 -0
  123. package/dist/wizard/wizard.js.map +1 -1
  124. package/package.json +1 -1
@@ -0,0 +1,198 @@
1
+ ---
2
+ name: data-science-data-versioning
3
+ description: When and how to version data for reproducibility — size-based rule for choosing between git+Parquet, git-lfs, and DVC
4
+ topics: [data-science, data-versioning, dvc, parquet, git-lfs]
5
+ ---
6
+
7
+ If you can't answer "what data produced this result", you can't reproduce it.
8
+ A model trained on 2026-02-14's snapshot will drift from one trained today,
9
+ and without a versioning story you have no way to roll back, diff, or explain
10
+ the difference to a reviewer six months later.
11
+
12
+ Data versioning answers that question without blowing up the repo — the trick
13
+ is picking a tool proportional to the dataset size. The common failure mode
14
+ is over-engineering (wiring up DVC with a remote for a 40 MB CSV) or
15
+ under-engineering (committing 2 GB of Parquet directly into git and
16
+ discovering three months later that every clone takes twenty minutes).
17
+
18
+ ## Summary
19
+
20
+ Pick your tool by size.
21
+
22
+ - Under ~1 GB of text or Parquet, plain git with committed Parquet files is fine.
23
+ - Between 1 and 10 GB, use `git-lfs` if you already use it on the team;
24
+ otherwise invest in `DVC` (Data Version Control).
25
+ - Above 10 GB — or for any binary artifact (model weights, image corpora,
26
+ audio) — always use DVC with a remote (s3, gcs, azure).
27
+ - Never version raw third-party data that you can't legally redistribute —
28
+ store a fetch script and a content hash instead.
29
+
30
+ ## Deep Guidance
31
+
32
+ ### Size-based decision rule
33
+
34
+ | Dataset | Tool | Why |
35
+ |---------|------|-----|
36
+ | <1 GB text / Parquet | git + Parquet | Columnar compression keeps files small; git history stays sane |
37
+ | 1–10 GB (judgment call) | `git-lfs` if already adopted; DVC if you have the habit | LFS is lower-effort; DVC gives you pipeline stages too |
38
+ | >10 GB or binary artifacts | DVC with remote | Git history will not tolerate binary churn at this scale |
39
+ | Raw third-party data | Don't version — script + hash | Redistribution is often prohibited; raw bytes bloat history |
40
+
41
+ The sizes above are rules of thumb, not hard thresholds. What actually
42
+ matters is how often the data changes. A 5 GB file that you generate once
43
+ and never touch again can live in `git-lfs` forever without pain. The same
44
+ 5 GB file regenerated weekly will accumulate 260 GB of LFS storage in a
45
+ year — that's the point where DVC's content-addressed cache starts to earn
46
+ its complexity.
47
+
48
+ A second factor is team shape. A solo researcher on a laptop rarely needs a
49
+ remote backing store; a two-person team on different continents almost
50
+ always does. Choose the tool that fits the smallest real collaboration
51
+ pattern you have, not the one that scales to the team you imagine having.
52
+
53
+ ### When git + Parquet is enough
54
+
55
+ For a solo or small-team project with modest data, commit processed Parquet directly. Keep raw data out of the repo; reserve git for cleaned, analysis-ready files.
56
+
57
+ ```python
58
+ # src/pipelines/clean.py
59
+ import pandas as pd
60
+
61
+ df = pd.read_csv("data/raw/events.csv") # data/raw/ is gitignored
62
+ clean = df.dropna(subset=["user_id"]).assign(ts=pd.to_datetime(df["ts"]))
63
+ clean.to_parquet("data/interim/events_clean.parquet", compression="zstd")
64
+ ```
65
+
66
+ ```gitignore
67
+ # .gitignore
68
+ data/raw/
69
+ data/external/
70
+ *.csv
71
+ !data/interim/*.parquet # do commit processed Parquet
72
+ ```
73
+
74
+ Parquet's columnar layout and zstd compression typically shrink tabular data
75
+ 5–10x versus CSV. Diffs aren't line-level but file-level content hashes are
76
+ stable, which is enough for "which version produced this model".
77
+
78
+ Pair the committed Parquet with a short data card — a markdown file in
79
+ `data/interim/events_clean.md` describing row count, schema, source, and the
80
+ commit that generated it — so readers of the repo a year later can tell what
81
+ they're looking at.
82
+
83
+ ### DVC basics
84
+
85
+ DVC treats large files as pointers tracked in git. The real bytes live on a remote (s3/gcs/azure/ssh), and a small `.dvc` metadata file is committed.
86
+
87
+ ```yaml
88
+ # dvc.yaml — pipeline stages with content-hashed inputs and outputs
89
+ stages:
90
+ ingest:
91
+ cmd: python src/ingest.py --out data/raw/events.parquet
92
+ outs:
93
+ - data/raw/events.parquet
94
+ process:
95
+ cmd: python src/process.py --in data/raw/events.parquet --out data/processed/features.parquet
96
+ deps:
97
+ - src/process.py
98
+ - data/raw/events.parquet
99
+ outs:
100
+ - data/processed/features.parquet
101
+ ```
102
+
103
+ Typical flow:
104
+
105
+ ```bash
106
+ dvc init # creates .dvc/ directory
107
+ dvc remote add -d storage s3://my-bucket/dvc-store
108
+ dvc add data/raw/big_dataset.csv # creates data/raw/big_dataset.csv.dvc (commit this)
109
+ dvc repro # runs stages whose inputs changed
110
+ dvc push # upload tracked files to remote
111
+ git add dvc.yaml dvc.lock data/raw/big_dataset.csv.dvc .gitignore
112
+ git commit -m "track raw events via DVC"
113
+ ```
114
+
115
+ `dvc.lock` records the content hash of every stage input and output, so
116
+ `dvc repro` on a peer's machine rebuilds exactly what you rebuilt. The
117
+ `.dvc/` directory holds local cache and config; the actual bytes never touch
118
+ git.
119
+
120
+ Mental model: git for code, DVC for data, both pointing at the same commit.
121
+ When you check out an older branch, git restores the source and the `.dvc`
122
+ pointers, and `dvc checkout` pulls matching data from the remote into your
123
+ working tree.
124
+
125
+ A common starting point: track one or two heavy inputs with `dvc add` (no
126
+ pipeline), and only adopt `dvc.yaml` stages once you have a repeatable
127
+ multi-step workflow. The overhead of stages pays off when you have 3+ steps
128
+ and want `dvc repro` to skip unchanged work; below that, plain `dvc add`
129
+ plus a Makefile is often clearer.
130
+
131
+ ### git-lfs middle ground
132
+
133
+ If you're already using Git LFS on the team but not ready to adopt DVC, it works acceptably for the 1–10 GB band — especially for a handful of files over the 100 MB GitHub push limit.
134
+
135
+ ```gitattributes
136
+ # .gitattributes
137
+ *.parquet filter=lfs diff=lfs merge=lfs -text
138
+ *.pkl filter=lfs diff=lfs merge=lfs -text
139
+ data/models/** filter=lfs diff=lfs merge=lfs -text
140
+ ```
141
+
142
+ ```bash
143
+ git lfs install
144
+ git lfs track "*.parquet"
145
+ git add .gitattributes data/features.parquet
146
+ git commit -m "add feature table via LFS"
147
+ ```
148
+
149
+ Reach for `git-lfs` when files are over ~100 MB but you don't need DVC's
150
+ pipeline stages or content-addressed reproducibility. Skip it if you already
151
+ have DVC set up — two tools versioning the same data is a recipe for
152
+ confusion.
153
+
154
+ LFS has real drawbacks to know about: bandwidth is metered on hosted plans,
155
+ `git clone` pulls every LFS object by default (use `GIT_LFS_SKIP_SMUDGE=1`
156
+ to defer), and you can't selectively prune history without rewriting the
157
+ whole repo. For a working group of 2–5 people on a research project these
158
+ are usually tolerable; for a fleet of CI workers cloning on every build
159
+ they are not.
160
+
161
+ ### What not to version
162
+
163
+ - **Third-party data with license constraints** — re-commit a fetch script (`scripts/fetch_kaggle.sh`) and record the SHA256 of the pulled file in a README. Re-download on each environment.
164
+ - **Regenerable intermediates** — if `dvc repro` or `make data` can recreate it deterministically from upstream inputs, don't commit the bytes.
165
+ - **Scratch / exploratory outputs** — `notebooks/scratch/`, `data/tmp/`, `*.ipynb_checkpoints/` belong in `.gitignore`.
166
+ - **Anti-pattern: committing 500 MB Parquet files directly to git** — they live forever in history, clone times balloon, and nobody will clean it up later. Move to DVC or LFS *before* the first large commit, not after. Rewriting history to extract large blobs (`git filter-repo`, BFG) is disruptive to every collaborator and should be a last resort.
167
+ - **Anti-pattern: versioning model checkpoints in git** — a single PyTorch checkpoint can be several hundred MB, and training runs produce dozens. Push them to DVC or an artifact store (MLflow, Weights & Biases) keyed by experiment run ID.
168
+
169
+ ### Quick migration path
170
+
171
+ If you're staring at a repo that has already committed large files to plain
172
+ git, the order of operations is:
173
+
174
+ 1. Decide the target tool (DVC for most cases where you got here).
175
+ 2. Run `dvc add` on the file in its current location — this untracks it
176
+ from git and creates a `.dvc` pointer.
177
+ 3. Commit the pointer and the updated `.gitignore`.
178
+ 4. Optionally run `git filter-repo` to purge the old blobs from history if
179
+ clone size has become painful.
180
+
181
+ Step 4 requires coordination — everyone must re-clone — so defer it until
182
+ the pain justifies the disruption.
183
+
184
+ ### Reproducibility in practice
185
+
186
+ The goal of all of this is a single concrete question: given a git commit,
187
+ can a teammate rebuild the exact model artifact that the commit describes?
188
+ Answer yes by pinning three things together:
189
+
190
+ - **Code** — the git commit itself.
191
+ - **Data** — a `.dvc` pointer, an LFS object, or a committed Parquet file,
192
+ all content-hashed.
193
+ - **Environment** — a pinned `requirements.txt`, `pyproject.toml`, or
194
+ `conda-lock.yml` committed in the same commit.
195
+
196
+ If any one of those three is missing, reproducibility is accidental. The
197
+ versioning tool you pick is less important than treating the three as a
198
+ single atomic unit — changed together, reviewed together, reverted together.
@@ -0,0 +1,159 @@
1
+ ---
2
+ name: data-science-dev-environment
3
+ description: Reproducible local Python dev environment for data science using uv, direnv, pre-commit, and pyproject.toml
4
+ topics: [data-science, dev-environment, uv, direnv, pre-commit]
5
+ ---
6
+
7
+ A data-science project that cannot be recreated in minutes is a liability. Notebooks pick up stale package versions, secrets leak into `.bashrc`, and "works on my machine" kills any chance of a collaborator (or future-you) rerunning an experiment. The fix is not complicated: one lockfile, one place for env vars, one pre-commit hook, no bespoke shell scripts. This guide is opinionated toward solo and small-team workflows where local-first beats container-first.
8
+
9
+ ## Summary
10
+
11
+ Use `uv` as the single Python package manager — it replaces `pip`, `pip-tools`, `venv`, and `virtualenv` with one fast, reproducible tool. Declare every dependency in `pyproject.toml` so there is exactly one source of truth, and commit `uv.lock` so `uv sync` gives any collaborator a byte-identical environment. Layer `direnv` on top for per-repo environment variables (tracking URIs, data paths, secrets pulled from a vault) so nothing leaks into your global shell. Add `pre-commit` with a small set of fast hooks (`ruff-format`, `ruff-check`, end-of-file fixer) so style and obvious bugs never enter a commit. Skip Docker for greenfield solo DS work — reach for it only when you cross an OS boundary (Mac dev, Linux prod) or depend on GPU/CUDA libraries.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### uv for Python environment and dependencies
16
+
17
+ `uv` is the 2025 default for Python packaging. It is a drop-in replacement for `pip`, `venv`, and `pip-tools`, written in Rust, and roughly 10-100x faster than the tools it replaces. For data science the combination of `uv sync` (reproduces the environment from the lockfile) and `uv run` (executes a script in the managed venv without activation) is the whole workflow.
18
+
19
+ Bootstrap a new project:
20
+
21
+ ```bash
22
+ uv init --python 3.12 myproject
23
+ cd myproject
24
+ uv add pandas numpy scikit-learn jupyterlab
25
+ uv add --dev ruff pytest pandera
26
+ uv sync # creates .venv and installs everything
27
+ uv run pytest # runs in the managed venv, no activation needed
28
+ uv run jupyter lab
29
+ ```
30
+
31
+ A minimal `pyproject.toml`:
32
+
33
+ ```toml
34
+ [project]
35
+ name = "myproject"
36
+ version = "0.1.0"
37
+ description = "Customer churn analysis"
38
+ requires-python = ">=3.12"
39
+ dependencies = [
40
+ "pandas>=2.2",
41
+ "numpy>=2.0",
42
+ "scikit-learn>=1.5",
43
+ "jupyterlab>=4.2",
44
+ "pandera>=0.20",
45
+ ]
46
+
47
+ [dependency-groups]
48
+ dev = [
49
+ "ruff>=0.6",
50
+ "pytest>=8.0",
51
+ "pre-commit>=3.8",
52
+ ]
53
+
54
+ [tool.ruff]
55
+ line-length = 100
56
+ target-version = "py312"
57
+
58
+ [tool.ruff.lint]
59
+ select = ["E", "F", "I", "B", "UP", "PD"] # PD = pandas-vet
60
+ ```
61
+
62
+ Commit both `pyproject.toml` and `uv.lock`. A collaborator clones the repo and runs `uv sync` — that is the entire setup step. No `pip install -r requirements.txt`, no virtualenv activation, no version drift.
63
+
64
+ Add a package with `uv add <name>`; remove with `uv remove <name>`. Both edit `pyproject.toml` and update `uv.lock` atomically. To pin a version use `uv add "pandas==2.2.3"`. To upgrade run `uv lock --upgrade-package pandas`.
65
+
66
+ Two `uv` features worth knowing for data science specifically:
67
+
68
+ - **`uv run script.py`** executes a file in the project venv with no activation step. Wire this into a `Makefile` or `justfile` so `make train` and `make eval` Just Work for any collaborator.
69
+ - **Inline script metadata (PEP 723).** For one-off analysis scripts that live outside the project, a shebang-style header declares dependencies and `uv run` auto-creates an ephemeral venv:
70
+
71
+ ```python
72
+ # /// script
73
+ # requires-python = ">=3.12"
74
+ # dependencies = ["pandas", "duckdb"]
75
+ # ///
76
+ import pandas as pd
77
+ import duckdb
78
+ ...
79
+ ```
80
+
81
+ Running `uv run oneoff.py` resolves and caches those deps transparently. No more "should I add this to `pyproject.toml`?" for throwaway exploration.
82
+
83
+ ### direnv for env vars
84
+
85
+ `direnv` loads a per-directory `.envrc` file whenever you `cd` into the project. It keeps secrets and tracking URIs out of your global shell and ensures every terminal session sees the same variables. Skip it if your project has no environment variables; add it the first time you reach for one.
86
+
87
+ Install once (`brew install direnv`, then hook into your shell per the docs). In the project:
88
+
89
+ ```bash
90
+ # .envrc — commit this file
91
+ use python .venv/bin/python # pin Python to the uv-managed venv
92
+ source_up # inherit vars from parent .envrc if present
93
+
94
+ # Experiment tracking
95
+ export MLFLOW_TRACKING_URI="http://localhost:5000"
96
+
97
+ # Data paths (relative to repo root)
98
+ export DATA_DIR="$PWD/data"
99
+ export MODELS_DIR="$PWD/models"
100
+
101
+ # Make imports work without installing the package
102
+ export PYTHONPATH="$PWD/src:$PYTHONPATH"
103
+
104
+ # Secrets — source from a local-only file, never commit
105
+ [[ -f .envrc.local ]] && source_env .envrc.local
106
+ ```
107
+
108
+ Add `.envrc.local` to `.gitignore` and put any actual secrets there (API keys, DB passwords). Run `direnv allow` once after creating or editing `.envrc`; `direnv` will refuse to load until you do. The moment you `cd` out of the project, all variables unload — no pollution.
109
+
110
+ ### pre-commit hooks
111
+
112
+ `pre-commit` runs a configured set of checks every time you `git commit`. Keep the hook list short and fast — anything slower than a second or two trains you to use `--no-verify`, which defeats the point. For data science the right starter set is format, lint, and a couple of sanity hooks.
113
+
114
+ ```yaml
115
+ # .pre-commit-config.yaml
116
+ repos:
117
+ - repo: https://github.com/astral-sh/ruff-pre-commit
118
+ rev: v0.6.9
119
+ hooks:
120
+ - id: ruff-format
121
+ - id: ruff-check
122
+ args: [--fix]
123
+
124
+ - repo: https://github.com/pre-commit/pre-commit-hooks
125
+ rev: v5.0.0
126
+ hooks:
127
+ - id: end-of-file-fixer
128
+ - id: trailing-whitespace
129
+ - id: check-yaml
130
+ - id: check-added-large-files
131
+ args: [--maxkb=500] # block accidental dataset commits
132
+
133
+ - repo: https://github.com/kynan/nbstripout
134
+ rev: 0.7.1
135
+ hooks:
136
+ - id: nbstripout # strips notebook outputs before commit
137
+ ```
138
+
139
+ Install the git hook once per clone:
140
+
141
+ ```bash
142
+ uv run pre-commit install
143
+ uv run pre-commit run --all-files # bootstrap: fix everything now
144
+ ```
145
+
146
+ Deliberately excluded from this set: `mypy`, `pytest`, and `bandit`. They are all valuable, but they are slow enough that they belong in CI, not in the commit path. Fast local, thorough remote.
147
+
148
+ ### When to add Docker
149
+
150
+ For greenfield solo data science, Docker is overhead you do not need. `uv sync` already gives you reproducibility on any machine with the same OS, and local iteration is faster without a container layer.
151
+
152
+ Reach for Docker only when one of these is true:
153
+
154
+ - **OS mismatch between dev and prod.** You develop on macOS but the model runs on Linux in production, and a native dependency (e.g. a C extension, a specific `libgomp`) behaves differently across platforms.
155
+ - **GPU / CUDA dependencies.** CUDA toolkit versions are tightly coupled to driver versions and OS. A pinned `nvidia/cuda` base image is the only sane way to guarantee training reproducibility across machines.
156
+ - **Handoff to MLOps or serving infra.** Production deployment targets (SageMaker, Vertex, KServe, plain Kubernetes) expect a container. Build one at the handoff boundary, not before.
157
+ - **Onboarding collaborators with hostile local setups.** A Windows colleague who cannot install `uv` natively is a reasonable reason to ship a devcontainer.
158
+
159
+ When you do add Docker, keep it thin: copy `pyproject.toml` and `uv.lock`, run `uv sync --frozen`, and let the same lockfile drive both local and container builds. That way the container is a packaging detail, not a parallel source of truth.
@@ -0,0 +1,194 @@
1
+ ---
2
+ name: data-science-experiment-tracking
3
+ description: Local MLflow setup, run instrumentation, git commit tagging, and run comparison for solo and small-team data science work
4
+ topics: [data-science, experiment-tracking, mlflow, weights-and-biases, reproducibility]
5
+ ---
6
+
7
+ Without experiment tracking, data science becomes archaeology: three weeks after a promising result, a stakeholder asks "which config produced that number?" and answering it turns into a forensic exercise — sifting through notebook history, Slack messages, and commented-out cells. A lightweight experiment tracker fixes this with one discipline: every run logs its hyperparameters, metrics, artifacts, and the git commit SHA that produced it. For a solo DS or small team, you do not need a shared server or a cloud account — a local MLflow instance on SQLite is enough to get the full benefit, and you can graduate to a shared deployment later without changing the instrumentation.
8
+
9
+ ## Summary
10
+
11
+ Self-host MLflow locally with a SQLite backend and a local artifact directory — it is the minimum setup that still gives you a queryable run history, a browsable UI, and reproducible run IDs. Every run logs the full hyperparameter dict, metrics per epoch (or iteration), the git commit SHA as a tag, dataset version, and any config or report artifacts. Weights & Biases is a reasonable cloud alternative if you value the polished UI and do not mind cloud storage — but for a DS-1 setup it is not the primary recommendation. Never log PII into run metadata or artifacts, and never commit `mlflow.db` or `mlartifacts/` to git.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### What to log per run
16
+
17
+ Treat every training run, hyperparameter tweak, or evaluation pass as a tracked experiment — even the exploratory ones you think will be throwaway. The cost of logging is trivial; the cost of not logging a run that turns out to matter is measured in hours of re-running and second-guessing. The minimum payload is:
18
+
19
+ - **Hyperparameters**: the full config dict (learning rate, batch size, seed, feature set, model type, loss weights, regularization). Log it all — future-you does not know which knob will matter and adding knobs retroactively is impossible.
20
+ - **Metrics**: logged with `step=epoch` (or `step=iteration`) so the UI can render a time-series plot. Log train and validation metrics side by side; a single final-value log loses the overfitting story.
21
+ - **Git commit SHA**: a tag pointing to the exact commit that produced the run. Without this, "reproduce run 47" is unanswerable, because the config alone does not capture code changes in the training loop, data loader, or feature engineering.
22
+ - **Dataset version**: a tag or param identifying which dataset snapshot was used — a DVC hash, a filename with a date suffix, or a data commit SHA. Without this, "reproduce run 47" is still unanswerable even if you have the code, because the data moved underneath it.
23
+ - **Run name**: a human-readable name (`baseline-v3-with-dropout`) so the UI list is browsable without clicking every row to read the params.
24
+ - **Artifacts**: the resolved config YAML, the evaluation report JSON, any confusion matrix images, and the final model checkpoint. Small artifacts go inline with the run; large model weights can be stored by reference.
25
+
26
+ ### MLflow self-hosted setup
27
+
28
+ Run the tracking server locally. SQLite is the right backend for a solo workflow — it gives you the full query API without the ops burden of Postgres, and the `mlflow.db` file is small enough that you can zip and share it with a collaborator if you really need to:
29
+
30
+ ```bash
31
+ mlflow server \
32
+ --backend-store-uri sqlite:///mlflow.db \
33
+ --default-artifact-root ./mlartifacts \
34
+ --host 127.0.0.1 --port 5000
35
+ ```
36
+
37
+ Bind to `127.0.0.1` rather than `0.0.0.0` so you do not accidentally expose an unauthenticated tracking server to your network. Leave it running in a terminal tab, a `tmux` pane, or under `launchd`/`systemd` — whatever keeps it up between sessions.
38
+
39
+ Point your code at the server via an environment variable. Using `direnv` keeps this per-project and avoids polluting your shell:
40
+
41
+ ```bash
42
+ # .envrc
43
+ export MLFLOW_TRACKING_URI=http://localhost:5000
44
+ export MLFLOW_EXPERIMENT_NAME=churn-baseline
45
+ ```
46
+
47
+ Add the tracking artifacts to `.gitignore` — they are large, local, and not reproducible from source. Committing them bloats the repo and leaks local-path metadata into history:
48
+
49
+ ```gitignore
50
+ # .gitignore
51
+ mlflow.db
52
+ mlflow.db-journal
53
+ mlartifacts/
54
+ mlruns/
55
+ ```
56
+
57
+ When you later graduate to a shared MLflow server (team deployment, S3 artifact store, Postgres backend), the only change is the `MLFLOW_TRACKING_URI` — your instrumentation code stays identical, and historical runs stay on your laptop as a personal archive.
58
+
59
+ ### Instrumenting a training / experiment run
60
+
61
+ Wrap the training loop in `mlflow.start_run`. The context manager handles start and end timestamps, guarantees the run closes even on exception, and exposes `run.info.run_id` — the stable handle you use later for comparison, export, or model loading:
62
+
63
+ ```python
64
+ import subprocess
65
+ import mlflow
66
+ import yaml
67
+
68
+ mlflow.set_tracking_uri("http://localhost:5000")
69
+ mlflow.set_experiment("churn-baseline")
70
+
71
+ def train(cfg: dict) -> dict:
72
+ with mlflow.start_run(run_name=cfg["experiment"]["name"]) as run:
73
+ # Log full hyperparameter dict (flatten nested keys to dot-paths)
74
+ mlflow.log_params(_flatten(cfg))
75
+
76
+ # Reproducibility tags — git commit is the single most important one
77
+ git_sha = subprocess.check_output(
78
+ ["git", "rev-parse", "HEAD"]
79
+ ).decode().strip()
80
+ mlflow.set_tag("git_commit", git_sha)
81
+ mlflow.set_tag("dataset_version", cfg["data"]["version"])
82
+ mlflow.set_tag("model_type", cfg["model"]["type"])
83
+
84
+ # Per-epoch metrics — step=epoch is what gives you a time-series plot
85
+ for epoch in range(cfg["training"]["epochs"]):
86
+ train_metrics = train_epoch(...)
87
+ val_metrics = evaluate(...)
88
+ mlflow.log_metrics({
89
+ "train_loss": train_metrics["loss"],
90
+ "val_loss": val_metrics["loss"],
91
+ "val_auc": val_metrics["auc"],
92
+ }, step=epoch)
93
+
94
+ # Artifacts: resolved config + eval report
95
+ with open("configs/resolved.yaml", "w") as f:
96
+ yaml.safe_dump(cfg, f)
97
+ mlflow.log_artifact("configs/resolved.yaml")
98
+ mlflow.log_artifact("reports/eval_report.json")
99
+
100
+ return {"run_id": run.info.run_id, **val_metrics}
101
+ ```
102
+
103
+ A few notes on the shape of this code. `mlflow.log_params` takes a flat dict, so a helper like `_flatten` turns `{"optimizer": {"lr": 1e-3}}` into `{"optimizer.lr": "0.001"}` — values are coerced to strings. Log the **resolved** config after any CLI overrides or hydra composition, not the raw file on disk, so the stored params match what actually ran. If the working tree is dirty at training time, either commit first or log `git status --porcelain` output as a tag so you can tell the logged commit is not the whole story. Keep the returned `run_id` — it is the primary key you will use to find this run in the UI, export its metadata, register its model later, or reference it from a downstream evaluation run via `mlflow.set_tag("parent_run_id", ...)`.
104
+
105
+ ### Run comparison and selection
106
+
107
+ Open the MLflow UI at `http://localhost:5000`. The three views that earn their keep:
108
+
109
+ - **Run list** — sort by `metrics.val_auc` or filter by `tags.git_commit = "<sha>"`. Tag filters are the fastest way to find "the runs I launched from this branch." Sort by columns to see the run_id of your best-performing experiment, then click through for the full picture.
110
+ - **Parallel coordinates plot** — select several runs, switch to the parallel coordinates view, and see which hyperparameters correlate with your target metric. This is the view that turns dozens of runs into a readable pattern — hover a line to see the full config, drag axes to filter a band, and the plot re-paints to show only the runs that meet your criterion.
111
+ - **Metric plot** — overlay `val_loss` across selected runs to spot overfitting (train loss drops, val loss rises), bad seeds (wildly different trajectories with the same config), or early-stopping candidates (val metric plateaued ten epochs before training ended).
112
+
113
+ You can also query programmatically when the UI's filters are not expressive enough:
114
+
115
+ ```python
116
+ from mlflow.tracking import MlflowClient
117
+ import pandas as pd
118
+
119
+ client = MlflowClient()
120
+ runs = client.search_runs(
121
+ experiment_ids=[client.get_experiment_by_name("churn-baseline").experiment_id],
122
+ filter_string="metrics.val_auc > 0.82 and tags.dataset_version = '2026-03'",
123
+ order_by=["metrics.val_auc DESC"],
124
+ max_results=20,
125
+ )
126
+ df = pd.DataFrame([{
127
+ "run_id": r.info.run_id,
128
+ "name": r.info.run_name,
129
+ "val_auc": r.data.metrics.get("val_auc"),
130
+ "lr": r.data.params.get("optimizer.lr"),
131
+ } for r in runs])
132
+ ```
133
+
134
+ When you have a winner, export its config back into the repo for a clean retrain:
135
+
136
+ ```python
137
+ run = client.get_run("<run_id>")
138
+ client.download_artifacts(run.info.run_id, "resolved.yaml", dst_path="configs/")
139
+ ```
140
+
141
+ Commit the exported config so "run 47's exact recipe" becomes a file in git, not a memory and not a database row that lives only on your laptop.
142
+
143
+ ### Weights & Biases as alternative
144
+
145
+ Weights & Biases is the polished cloud alternative. It has a richer UI, built-in system metric logging (GPU, memory, temperature), gradient histograms via `wandb.watch(model, log="gradients")`, media logging (images, audio, tables, confusion matrices rendered inline), and better collaboration features — named reports, shared dashboards, thread-style comments. For a small team that has already decided it is comfortable with cloud storage, W&B removes the "who runs the server" question entirely, and the onboarding for a new teammate is `pip install wandb && wandb login`.
146
+
147
+ The instrumentation shape is familiar:
148
+
149
+ ```python
150
+ import wandb
151
+ wandb.init(project="churn-baseline", name=cfg["experiment"]["name"],
152
+ config=cfg, tags=["baseline", "v2-features"])
153
+ for epoch in range(cfg["training"]["epochs"]):
154
+ wandb.log({"epoch": epoch, "val/auc": val_auc, "train/loss": loss})
155
+ wandb.finish()
156
+ ```
157
+
158
+ Two things to weigh before picking it over MLflow for a DS-1 setup. First, the free tier has trial limits on private projects, artifact retention, and seats — fine for a solo experimenter, worth pricing out before a team commits. Second, you are shipping your experiment metadata (and potentially artifacts) to a third party, which matters if your dataset values, config parameters, or run names might accidentally encode something sensitive. MLflow self-hosted stays on your laptop; W&B lives in someone else's cloud. If your org has a data-residency or vendor-review process, MLflow skips it entirely.
159
+
160
+ ### Graduating from solo to shared
161
+
162
+ The path from "SQLite on my laptop" to "shared team tracker" is short and deliberately low-risk, because your instrumentation already speaks the MLflow protocol:
163
+
164
+ 1. **Stand up a shared MLflow server** behind your internal network — Postgres backend, S3 or equivalent object store for artifacts, authentication in front (oauth2-proxy, nginx basic auth, or a cloud load balancer).
165
+ 2. **Flip the tracking URI** in each project's `.envrc` to point at the shared server. No code changes.
166
+ 3. **Optionally backfill historic runs** using `mlflow artifacts download` plus `search_runs`, then re-log to the shared server — only worth it for runs you want the team to see.
167
+ 4. **Keep the local server configured** for offline or air-gapped work — you can still set `MLFLOW_TRACKING_URI=file:./mlruns` for quick local-only iteration when the shared server is down or you are on a plane.
168
+
169
+ This is why the self-hosted local setup is the right default even if you know you will eventually run a team server: the instrumentation you write today is exactly the instrumentation that will talk to production tomorrow.
170
+
171
+ ### Nested runs for sweeps and evaluations
172
+
173
+ When you run a small hyperparameter sweep from a laptop — a few learning rates, two or three seeds — use MLflow's nested runs rather than one flat run per trial. A parent run captures the sweep-level config and best-metric summary; each trial becomes a child with its own params and metrics:
174
+
175
+ ```python
176
+ with mlflow.start_run(run_name="lr-sweep") as parent:
177
+ mlflow.log_param("sweep_type", "grid")
178
+ best_auc = 0.0
179
+ for lr in [1e-4, 3e-4, 1e-3, 3e-3]:
180
+ with mlflow.start_run(run_name=f"lr={lr}", nested=True) as child:
181
+ mlflow.log_param("lr", lr)
182
+ val_auc = train_one(lr)
183
+ mlflow.log_metric("val_auc", val_auc)
184
+ best_auc = max(best_auc, val_auc)
185
+ mlflow.log_metric("best_val_auc", best_auc)
186
+ ```
187
+
188
+ In the UI, the parent shows an expandable tree of children, which keeps the run list navigable once you have hundreds of rows. For larger sweeps, pair MLflow with Optuna (`mlflow.start_run(nested=True)` inside an `objective` function) — you get Bayesian search on top of MLflow's persistence.
189
+
190
+ ### Hygiene
191
+
192
+ **Never log PII.** Experiment metadata and artifacts are easy to share with collaborators, screenshot into a ticket, or accidentally export to a cloud UI when you later migrate to W&B or a shared MLflow. Hyperparameter values, metric names, run names, tags, and artifact contents must all be free of customer identifiers, emails, names, or raw records. If a config carries a data path, keep it at the dataset level (`data/processed/churn_2026_03.parquet`) — never at the row or user level (`data/users/alice@example.com/history.parquet`). Evaluation reports that include example predictions must redact any PII in the input columns before being logged as artifacts. See `data-science-security.md` for the broader no-PII rules that apply across the whole DS workflow.
193
+
194
+ **Never commit the tracking store.** `mlflow.db`, `mlflow.db-journal`, `mlartifacts/`, and `mlruns/` all belong in `.gitignore`. They are large (easily hundreds of MB once you log a few model checkpoints), machine-local (paths and timestamps are specific to your laptop), and not reproducible from git. If you need a teammate to see a run, export the specific artifacts you want to share via `mlflow artifacts download --run-id <id>` or by running a shared tracking server — do not push the whole store into the repo and do not email `mlflow.db` as an attachment.