@zigrivers/scaffold 3.22.0 → 3.24.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (111) hide show
  1. package/README.md +44 -23
  2. package/content/knowledge/core/automated-review-tooling.md +3 -3
  3. package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
  4. package/content/knowledge/data-science/README.md +23 -0
  5. package/content/knowledge/data-science/data-science-architecture.md +163 -0
  6. package/content/knowledge/data-science/data-science-conventions.md +233 -0
  7. package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
  8. package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
  9. package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
  10. package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
  11. package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
  12. package/content/knowledge/data-science/data-science-observability.md +161 -0
  13. package/content/knowledge/data-science/data-science-project-structure.md +178 -0
  14. package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
  15. package/content/knowledge/data-science/data-science-requirements.md +151 -0
  16. package/content/knowledge/data-science/data-science-security.md +151 -0
  17. package/content/knowledge/data-science/data-science-testing.md +183 -0
  18. package/content/knowledge/ml/README.md +10 -0
  19. package/content/methodology/data-science-overlay.yml +39 -0
  20. package/content/pipeline/build/multi-agent-resume.md +7 -6
  21. package/content/pipeline/build/multi-agent-start.md +7 -6
  22. package/content/pipeline/build/single-agent-resume.md +7 -6
  23. package/content/pipeline/build/single-agent-start.md +7 -6
  24. package/content/pipeline/environment/automated-pr-review.md +79 -27
  25. package/content/skills/mmr/SKILL.md +72 -2
  26. package/content/skills/scaffold-runner/SKILL.md +65 -19
  27. package/content/tools/review-code.md +74 -16
  28. package/content/tools/review-pr.md +25 -6
  29. package/dist/cli/commands/check.d.ts.map +1 -1
  30. package/dist/cli/commands/check.js +28 -17
  31. package/dist/cli/commands/check.js.map +1 -1
  32. package/dist/config/schema.d.ts +672 -126
  33. package/dist/config/schema.d.ts.map +1 -1
  34. package/dist/config/schema.js +8 -0
  35. package/dist/config/schema.js.map +1 -1
  36. package/dist/config/schema.test.js +2 -2
  37. package/dist/config/schema.test.js.map +1 -1
  38. package/dist/config/validators/data-science.d.ts +4 -0
  39. package/dist/config/validators/data-science.d.ts.map +1 -0
  40. package/dist/config/validators/data-science.js +15 -0
  41. package/dist/config/validators/data-science.js.map +1 -0
  42. package/dist/config/validators/index.d.ts.map +1 -1
  43. package/dist/config/validators/index.js +2 -0
  44. package/dist/config/validators/index.js.map +1 -1
  45. package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
  46. package/dist/core/assembly/knowledge-loader.js +6 -0
  47. package/dist/core/assembly/knowledge-loader.js.map +1 -1
  48. package/dist/core/assembly/knowledge-loader.test.js +34 -0
  49. package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
  50. package/dist/e2e/project-type-overlays.test.js +73 -0
  51. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  52. package/dist/project/adopt.d.ts.map +1 -1
  53. package/dist/project/adopt.js +3 -1
  54. package/dist/project/adopt.js.map +1 -1
  55. package/dist/project/detectors/coverage.test.d.ts +2 -0
  56. package/dist/project/detectors/coverage.test.d.ts.map +1 -0
  57. package/dist/project/detectors/coverage.test.js +78 -0
  58. package/dist/project/detectors/coverage.test.js.map +1 -0
  59. package/dist/project/detectors/data-science.d.ts +4 -0
  60. package/dist/project/detectors/data-science.d.ts.map +1 -0
  61. package/dist/project/detectors/data-science.js +32 -0
  62. package/dist/project/detectors/data-science.js.map +1 -0
  63. package/dist/project/detectors/data-science.test.d.ts +2 -0
  64. package/dist/project/detectors/data-science.test.d.ts.map +1 -0
  65. package/dist/project/detectors/data-science.test.js +62 -0
  66. package/dist/project/detectors/data-science.test.js.map +1 -0
  67. package/dist/project/detectors/disambiguate.d.ts +2 -0
  68. package/dist/project/detectors/disambiguate.d.ts.map +1 -1
  69. package/dist/project/detectors/disambiguate.js +3 -2
  70. package/dist/project/detectors/disambiguate.js.map +1 -1
  71. package/dist/project/detectors/disambiguate.test.js +10 -1
  72. package/dist/project/detectors/disambiguate.test.js.map +1 -1
  73. package/dist/project/detectors/index.d.ts.map +1 -1
  74. package/dist/project/detectors/index.js +2 -0
  75. package/dist/project/detectors/index.js.map +1 -1
  76. package/dist/project/detectors/library.d.ts.map +1 -1
  77. package/dist/project/detectors/library.js +1 -0
  78. package/dist/project/detectors/library.js.map +1 -1
  79. package/dist/project/detectors/resolve-detection.test.js +31 -0
  80. package/dist/project/detectors/resolve-detection.test.js.map +1 -1
  81. package/dist/project/detectors/types.d.ts +6 -2
  82. package/dist/project/detectors/types.d.ts.map +1 -1
  83. package/dist/project/detectors/types.js.map +1 -1
  84. package/dist/types/config.d.ts +8 -1
  85. package/dist/types/config.d.ts.map +1 -1
  86. package/dist/wizard/copy/core.d.ts.map +1 -1
  87. package/dist/wizard/copy/core.js +4 -0
  88. package/dist/wizard/copy/core.js.map +1 -1
  89. package/dist/wizard/copy/data-science.d.ts +3 -0
  90. package/dist/wizard/copy/data-science.d.ts.map +1 -0
  91. package/dist/wizard/copy/data-science.js +15 -0
  92. package/dist/wizard/copy/data-science.js.map +1 -0
  93. package/dist/wizard/copy/index.d.ts.map +1 -1
  94. package/dist/wizard/copy/index.js +2 -0
  95. package/dist/wizard/copy/index.js.map +1 -1
  96. package/dist/wizard/copy/types.d.ts +5 -1
  97. package/dist/wizard/copy/types.d.ts.map +1 -1
  98. package/dist/wizard/copy/types.test-d.js +7 -0
  99. package/dist/wizard/copy/types.test-d.js.map +1 -1
  100. package/dist/wizard/questions.d.ts +2 -1
  101. package/dist/wizard/questions.d.ts.map +1 -1
  102. package/dist/wizard/questions.js +9 -1
  103. package/dist/wizard/questions.js.map +1 -1
  104. package/dist/wizard/questions.test.js +14 -0
  105. package/dist/wizard/questions.test.js.map +1 -1
  106. package/dist/wizard/wizard.d.ts.map +1 -1
  107. package/dist/wizard/wizard.js +1 -0
  108. package/dist/wizard/wizard.js.map +1 -1
  109. package/package.json +1 -1
  110. package/skills/mmr/SKILL.md +72 -2
  111. package/skills/scaffold-runner/SKILL.md +65 -19
@@ -0,0 +1,233 @@
1
+ ---
2
+ name: data-science-conventions
3
+ description: Python coding conventions for solo data-science work — ruff for lint+format, pragmatic type hints, pyproject.toml as single config source, import ordering, module layout, naming, and docstrings
4
+ topics: [data-science, conventions, python, ruff, type-hints]
5
+ ---
6
+
7
+ Solo data-science code drifts faster than any other kind of Python: half of it lives in notebooks, the other half migrates into scripts, and nothing stays stable long enough to earn a style review. Consistent conventions are the only thing that keeps cognitive load bounded when you come back to a project after two months. Encode them in tooling (`ruff`, `pyproject.toml`) so they run on save — not on willpower — and the notebook→script promotion path stays smooth instead of becoming a cleanup tax.
8
+
9
+ ## Summary
10
+
11
+ Use `ruff` as the single lint + format tool — `ruff format` is Black-compatible and replaces Black, so do not install both. Apply `type hints` pragmatically: typed on any function another module imports, omitted on throwaway notebook helpers. Centralize all project and tool configuration in `pyproject.toml` — one file for build metadata, dependencies, ruff, and pytest. Use `ruff`/`isort`-style import sections (stdlib → third-party → local), a flat `src/` layout with a clear module split, and docstrings sized to the consumer: one-liners for internal helpers, full Google/NumPy style for anything a teammate will call without reading the source.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Linter + formatter (ruff)
16
+
17
+ `ruff` is the only Python linter/formatter a solo DS project needs. It replaces `flake8`, `isort`, `pyupgrade`, `pydocstyle`, `pylint` (mostly), and — via `ruff format` — Black. It is an order of magnitude faster than the tools it replaces, configured in one `[tool.ruff]` block, and has no plugin-management overhead. Do not layer Black on top: `ruff format` implements the same formatting contract, and running both just causes churn.
18
+
19
+ ```toml
20
+ # pyproject.toml
21
+ [tool.ruff]
22
+ line-length = 100
23
+ target-version = "py311"
24
+ extend-exclude = ["notebooks/_scratch", "data", "models"]
25
+
26
+ [tool.ruff.lint]
27
+ select = [
28
+ "E", # pycodestyle errors
29
+ "W", # pycodestyle warnings
30
+ "F", # pyflakes
31
+ "I", # isort (import sorting)
32
+ "N", # pep8-naming
33
+ "UP", # pyupgrade
34
+ "B", # flake8-bugbear
35
+ "C90", # mccabe complexity
36
+ "D", # pydocstyle
37
+ ]
38
+ ignore = [
39
+ "D100", # missing docstring in public module — noisy for scripts
40
+ "D104", # missing docstring in public package
41
+ "E501", # line-too-long — formatter handles it
42
+ ]
43
+
44
+ [tool.ruff.lint.per-file-ignores]
45
+ # Notebooks and experiment scripts get a lighter hand
46
+ "notebooks/**/*.py" = ["D", "N806", "E402"]
47
+ "scripts/**/*.py" = ["D"]
48
+ "tests/**/*.py" = ["D"]
49
+
50
+ [tool.ruff.lint.pydocstyle]
51
+ convention = "google"
52
+
53
+ [tool.ruff.lint.mccabe]
54
+ max-complexity = 12
55
+
56
+ [tool.ruff.format]
57
+ quote-style = "double"
58
+ indent-style = "space"
59
+ ```
60
+
61
+ **Tradeoff**: notebook and exploration code legitimately breaks rules that production code should not — uppercase variable names (`X_train`), imports after executable code, no docstrings. The `per-file-ignores` block disables the rules that fight notebook workflows without weakening the defaults for `src/`. Do not globally ignore `D` or `N` just to silence notebook noise.
62
+
63
+ Run on save (editor integration) and as a pre-commit hook. In CI, run `ruff check .` and `ruff format --check .` — the `--check` flag fails instead of rewriting.
64
+
65
+ ### Type hints
66
+
67
+ Python is not a typed language, and pretending it is in exploratory code wastes time. The rule is **import boundary = type boundary**: if another module imports the function, type it. Notebook-local helpers and inline lambdas do not need annotations.
68
+
69
+ ```python
70
+ # src/features/encoders.py — imported by training and serving, fully typed
71
+ from __future__ import annotations
72
+
73
+ import numpy as np
74
+ import pandas as pd
75
+
76
+
77
+ def target_encode(
78
+ series: pd.Series,
79
+ target: pd.Series,
80
+ smoothing: float = 10.0,
81
+ ) -> pd.Series:
82
+ """Smoothed target encoding for a categorical feature.
83
+
84
+ Args:
85
+ series: Categorical feature values (any hashable dtype).
86
+ target: Numeric target aligned to `series` by index.
87
+ smoothing: Prior weight; higher values pull rare categories
88
+ toward the global mean.
89
+
90
+ Returns:
91
+ Series of encoded floats aligned to `series.index`.
92
+ """
93
+ global_mean = target.mean()
94
+ agg = target.groupby(series).agg(["mean", "count"])
95
+ weight = agg["count"] / (agg["count"] + smoothing)
96
+ encoding = weight * agg["mean"] + (1 - weight) * global_mean
97
+ return series.map(encoding).astype(np.float64)
98
+ ```
99
+
100
+ ```python
101
+ # notebooks/03_eda.py — throwaway scratch, no annotations needed
102
+ def quick_hist(col):
103
+ return df[col].value_counts().head(20)
104
+
105
+ for c in cat_cols:
106
+ print(c, quick_hist(c).to_dict())
107
+ ```
108
+
109
+ Practical rules:
110
+ - Type every function exported from `src/` — parameters and return.
111
+ - Type dataclasses and `TypedDict` schemas that describe data contracts (row shapes, config dicts).
112
+ - Skip annotations on notebook cells, inline closures, and private helpers inside a single script.
113
+ - Use `from __future__ import annotations` at the top of every `src/` file — it makes all annotations lazy strings, so forward references and expensive-to-import types (`torch.Tensor`, `pd.DataFrame`) cost nothing at import time.
114
+ - Do not run `mypy --strict` on a solo DS project. Run it on `src/` with `--ignore-missing-imports` if you want a safety net, and do not bother with notebooks.
115
+
116
+ ### Project layout and pyproject.toml
117
+
118
+ One `pyproject.toml` at the repo root configures the build, dependencies, lint, format, and tests. Do not scatter config across `setup.cfg`, `.flake8`, `.isort.cfg`, and `pytest.ini` — everything lives in `pyproject.toml`.
119
+
120
+ ```toml
121
+ # pyproject.toml
122
+ [build-system]
123
+ requires = ["hatchling"]
124
+ build-backend = "hatchling.build"
125
+
126
+ [project]
127
+ name = "churn-model"
128
+ version = "0.1.0"
129
+ description = "Customer churn prediction — feature pipeline, training, and serving."
130
+ requires-python = ">=3.11"
131
+ dependencies = [
132
+ "pandas>=2.1",
133
+ "numpy>=1.26",
134
+ "scikit-learn>=1.4",
135
+ "pydantic>=2.5",
136
+ ]
137
+
138
+ [project.optional-dependencies]
139
+ dev = [
140
+ "ruff>=0.3",
141
+ "pytest>=8.0",
142
+ "pytest-cov>=4.1",
143
+ "ipykernel>=6.29",
144
+ ]
145
+
146
+ [tool.ruff]
147
+ line-length = 100
148
+ target-version = "py311"
149
+ # ... (see ruff section above)
150
+
151
+ [tool.pytest.ini_options]
152
+ testpaths = ["tests"]
153
+ addopts = "-ra --strict-markers --cov=src --cov-report=term-missing"
154
+ markers = [
155
+ "slow: marks tests as slow (deselect with '-m \"not slow\"')",
156
+ ]
157
+ ```
158
+
159
+ Repo layout:
160
+
161
+ ```
162
+ churn-model/
163
+ pyproject.toml
164
+ README.md
165
+ src/
166
+ churn_model/
167
+ __init__.py
168
+ data/ # loaders, schemas, splits
169
+ features/ # transformers, encoders, selection
170
+ models/ # model definitions and wrappers
171
+ training/ # train loops, CV runners
172
+ evaluation/ # metrics, diagnostics
173
+ serving/ # inference helpers
174
+ notebooks/
175
+ 01_data_audit.ipynb
176
+ 02_feature_exploration.ipynb
177
+ tests/
178
+ test_features.py
179
+ test_training.py
180
+ configs/
181
+ base.yaml
182
+ ```
183
+
184
+ Use a `src/` layout (not flat) so imports always go through the installed package — this prevents the "works in notebook, breaks in test" failure mode where `from my_module import x` resolves from the CWD instead of the package.
185
+
186
+ ### Import ordering
187
+
188
+ `ruff` with rule `I` enforces `isort`-compatible sections automatically. The contract:
189
+
190
+ 1. Future imports (`from __future__ import annotations`)
191
+ 2. Standard library
192
+ 3. Third-party
193
+ 4. First-party (your package)
194
+ 5. Local relative (`from .utils import ...`)
195
+
196
+ One blank line between sections, alphabetical within each. Do not hand-maintain this — `ruff check --fix` sorts imports in milliseconds.
197
+
198
+ ```python
199
+ from __future__ import annotations
200
+
201
+ import json
202
+ from pathlib import Path
203
+
204
+ import numpy as np
205
+ import pandas as pd
206
+ from sklearn.model_selection import KFold
207
+
208
+ from churn_model.data import load_raw
209
+ from churn_model.features import target_encode
210
+
211
+ from .utils import timed
212
+ ```
213
+
214
+ ### Naming and docstrings
215
+
216
+ Naming rubric (enforced by `ruff` rule `N`):
217
+
218
+ - **Modules/files**: `snake_case.py` (`feature_store.py`, not `FeatureStore.py`).
219
+ - **Functions/variables**: `snake_case` (`compute_auc`, `n_splits`).
220
+ - **Classes**: `PascalCase` (`ChurnDataset`, `TargetEncoder`).
221
+ - **Constants**: `UPPER_SNAKE_CASE` at module top level (`DEFAULT_SEED = 42`, `FEATURE_COLUMNS: tuple[str, ...] = (...)`).
222
+ - **Private**: single leading underscore (`_internal_helper`). Double underscore only when you specifically want name-mangling inside a class.
223
+ - **Type variables**: `PascalCase` with suffix (`ModelT = TypeVar("ModelT")`).
224
+ - **DataFrame matrices**: `X`, `y`, `X_train`, `y_test` are the one permitted uppercase exception — this is ML convention and `ruff` can be told to allow it via `N806` ignore in model/training modules.
225
+
226
+ Docstring style sizing — match the cost of writing the docstring to the consumer:
227
+
228
+ - **Terse one-liner** for private helpers and obvious utilities. `"""Return the 95th percentile of non-null values."""` is enough.
229
+ - **Full Google-style** (Args/Returns/Raises) for any public function in `src/features/`, `src/models/`, or `src/serving/` — anything a teammate or future-you will call without opening the source. See the `target_encode` example above.
230
+ - **Module docstring** on every `src/` module: one sentence describing what lives there. Skip on `scripts/` and `notebooks/`.
231
+ - **Class docstring** covers the class contract; `__init__` args go in the class docstring, not a separate `__init__` docstring. (This is the Google convention and `ruff`'s `pydocstyle` setting enforces it.)
232
+
233
+ Pick Google **or** NumPy style — not both — and set it in `[tool.ruff.lint.pydocstyle]`. Google is more compact and reads better in IDE hover; NumPy is better when you have long parameter descriptions with math. For solo DS, Google is the default recommendation.
@@ -0,0 +1,198 @@
1
+ ---
2
+ name: data-science-data-versioning
3
+ description: When and how to version data for reproducibility — size-based rule for choosing between git+Parquet, git-lfs, and DVC
4
+ topics: [data-science, data-versioning, dvc, parquet, git-lfs]
5
+ ---
6
+
7
+ If you can't answer "what data produced this result", you can't reproduce it.
8
+ A model trained on 2026-02-14's snapshot will drift from one trained today,
9
+ and without a versioning story you have no way to roll back, diff, or explain
10
+ the difference to a reviewer six months later.
11
+
12
+ Data versioning answers that question without blowing up the repo — the trick
13
+ is picking a tool proportional to the dataset size. The common failure mode
14
+ is over-engineering (wiring up DVC with a remote for a 40 MB CSV) or
15
+ under-engineering (committing 2 GB of Parquet directly into git and
16
+ discovering three months later that every clone takes twenty minutes).
17
+
18
+ ## Summary
19
+
20
+ Pick your tool by size.
21
+
22
+ - Under ~1 GB of text or Parquet, plain git with committed Parquet files is fine.
23
+ - Between 1 and 10 GB, use `git-lfs` if you already use it on the team;
24
+ otherwise invest in `DVC` (Data Version Control).
25
+ - Above 10 GB — or for any binary artifact (model weights, image corpora,
26
+ audio) — always use DVC with a remote (s3, gcs, azure).
27
+ - Never version raw third-party data that you can't legally redistribute —
28
+ store a fetch script and a content hash instead.
29
+
30
+ ## Deep Guidance
31
+
32
+ ### Size-based decision rule
33
+
34
+ | Dataset | Tool | Why |
35
+ |---------|------|-----|
36
+ | <1 GB text / Parquet | git + Parquet | Columnar compression keeps files small; git history stays sane |
37
+ | 1–10 GB (judgment call) | `git-lfs` if already adopted; DVC if you have the habit | LFS is lower-effort; DVC gives you pipeline stages too |
38
+ | >10 GB or binary artifacts | DVC with remote | Git history will not tolerate binary churn at this scale |
39
+ | Raw third-party data | Don't version — script + hash | Redistribution is often prohibited; raw bytes bloat history |
40
+
41
+ The sizes above are rules of thumb, not hard thresholds. What actually
42
+ matters is how often the data changes. A 5 GB file that you generate once
43
+ and never touch again can live in `git-lfs` forever without pain. The same
44
+ 5 GB file regenerated weekly will accumulate 260 GB of LFS storage in a
45
+ year — that's the point where DVC's content-addressed cache starts to earn
46
+ its complexity.
47
+
48
+ A second factor is team shape. A solo researcher on a laptop rarely needs a
49
+ remote backing store; a two-person team on different continents almost
50
+ always does. Choose the tool that fits the smallest real collaboration
51
+ pattern you have, not the one that scales to the team you imagine having.
52
+
53
+ ### When git + Parquet is enough
54
+
55
+ For a solo or small-team project with modest data, commit processed Parquet directly. Keep raw data out of the repo; reserve git for cleaned, analysis-ready files.
56
+
57
+ ```python
58
+ # src/pipelines/clean.py
59
+ import pandas as pd
60
+
61
+ df = pd.read_csv("data/raw/events.csv") # data/raw/ is gitignored
62
+ clean = df.dropna(subset=["user_id"]).assign(ts=pd.to_datetime(df["ts"]))
63
+ clean.to_parquet("data/interim/events_clean.parquet", compression="zstd")
64
+ ```
65
+
66
+ ```gitignore
67
+ # .gitignore
68
+ data/raw/
69
+ data/external/
70
+ *.csv
71
+ !data/interim/*.parquet # do commit processed Parquet
72
+ ```
73
+
74
+ Parquet's columnar layout and zstd compression typically shrink tabular data
75
+ 5–10x versus CSV. Diffs aren't line-level but file-level content hashes are
76
+ stable, which is enough for "which version produced this model".
77
+
78
+ Pair the committed Parquet with a short data card — a markdown file in
79
+ `data/interim/events_clean.md` describing row count, schema, source, and the
80
+ commit that generated it — so readers of the repo a year later can tell what
81
+ they're looking at.
82
+
83
+ ### DVC basics
84
+
85
+ DVC treats large files as pointers tracked in git. The real bytes live on a remote (s3/gcs/azure/ssh), and a small `.dvc` metadata file is committed.
86
+
87
+ ```yaml
88
+ # dvc.yaml — pipeline stages with content-hashed inputs and outputs
89
+ stages:
90
+ ingest:
91
+ cmd: python src/ingest.py --out data/raw/events.parquet
92
+ outs:
93
+ - data/raw/events.parquet
94
+ process:
95
+ cmd: python src/process.py --in data/raw/events.parquet --out data/processed/features.parquet
96
+ deps:
97
+ - src/process.py
98
+ - data/raw/events.parquet
99
+ outs:
100
+ - data/processed/features.parquet
101
+ ```
102
+
103
+ Typical flow:
104
+
105
+ ```bash
106
+ dvc init # creates .dvc/ directory
107
+ dvc remote add -d storage s3://my-bucket/dvc-store
108
+ dvc add data/raw/big_dataset.csv # creates data/raw/big_dataset.csv.dvc (commit this)
109
+ dvc repro # runs stages whose inputs changed
110
+ dvc push # upload tracked files to remote
111
+ git add dvc.yaml dvc.lock data/raw/big_dataset.csv.dvc .gitignore
112
+ git commit -m "track raw events via DVC"
113
+ ```
114
+
115
+ `dvc.lock` records the content hash of every stage input and output, so
116
+ `dvc repro` on a peer's machine rebuilds exactly what you rebuilt. The
117
+ `.dvc/` directory holds local cache and config; the actual bytes never touch
118
+ git.
119
+
120
+ Mental model: git for code, DVC for data, both pointing at the same commit.
121
+ When you check out an older branch, git restores the source and the `.dvc`
122
+ pointers, and `dvc checkout` pulls matching data from the remote into your
123
+ working tree.
124
+
125
+ A common starting point: track one or two heavy inputs with `dvc add` (no
126
+ pipeline), and only adopt `dvc.yaml` stages once you have a repeatable
127
+ multi-step workflow. The overhead of stages pays off when you have 3+ steps
128
+ and want `dvc repro` to skip unchanged work; below that, plain `dvc add`
129
+ plus a Makefile is often clearer.
130
+
131
+ ### git-lfs middle ground
132
+
133
+ If you're already using Git LFS on the team but not ready to adopt DVC, it works acceptably for the 1–10 GB band — especially for a handful of files over the 100 MB GitHub push limit.
134
+
135
+ ```gitattributes
136
+ # .gitattributes
137
+ *.parquet filter=lfs diff=lfs merge=lfs -text
138
+ *.pkl filter=lfs diff=lfs merge=lfs -text
139
+ data/models/** filter=lfs diff=lfs merge=lfs -text
140
+ ```
141
+
142
+ ```bash
143
+ git lfs install
144
+ git lfs track "*.parquet"
145
+ git add .gitattributes data/features.parquet
146
+ git commit -m "add feature table via LFS"
147
+ ```
148
+
149
+ Reach for `git-lfs` when files are over ~100 MB but you don't need DVC's
150
+ pipeline stages or content-addressed reproducibility. Skip it if you already
151
+ have DVC set up — two tools versioning the same data is a recipe for
152
+ confusion.
153
+
154
+ LFS has real drawbacks to know about: bandwidth is metered on hosted plans,
155
+ `git clone` pulls every LFS object by default (use `GIT_LFS_SKIP_SMUDGE=1`
156
+ to defer), and you can't selectively prune history without rewriting the
157
+ whole repo. For a working group of 2–5 people on a research project these
158
+ are usually tolerable; for a fleet of CI workers cloning on every build
159
+ they are not.
160
+
161
+ ### What not to version
162
+
163
+ - **Third-party data with license constraints** — re-commit a fetch script (`scripts/fetch_kaggle.sh`) and record the SHA256 of the pulled file in a README. Re-download on each environment.
164
+ - **Regenerable intermediates** — if `dvc repro` or `make data` can recreate it deterministically from upstream inputs, don't commit the bytes.
165
+ - **Scratch / exploratory outputs** — `notebooks/scratch/`, `data/tmp/`, `*.ipynb_checkpoints/` belong in `.gitignore`.
166
+ - **Anti-pattern: committing 500 MB Parquet files directly to git** — they live forever in history, clone times balloon, and nobody will clean it up later. Move to DVC or LFS *before* the first large commit, not after. Rewriting history to extract large blobs (`git filter-repo`, BFG) is disruptive to every collaborator and should be a last resort.
167
+ - **Anti-pattern: versioning model checkpoints in git** — a single PyTorch checkpoint can be several hundred MB, and training runs produce dozens. Push them to DVC or an artifact store (MLflow, Weights & Biases) keyed by experiment run ID.
168
+
169
+ ### Quick migration path
170
+
171
+ If you're staring at a repo that has already committed large files to plain
172
+ git, the order of operations is:
173
+
174
+ 1. Decide the target tool (DVC for most cases where you got here).
175
+ 2. Run `dvc add` on the file in its current location — this untracks it
176
+ from git and creates a `.dvc` pointer.
177
+ 3. Commit the pointer and the updated `.gitignore`.
178
+ 4. Optionally run `git filter-repo` to purge the old blobs from history if
179
+ clone size has become painful.
180
+
181
+ Step 4 requires coordination — everyone must re-clone — so defer it until
182
+ the pain justifies the disruption.
183
+
184
+ ### Reproducibility in practice
185
+
186
+ The goal of all of this is a single concrete question: given a git commit,
187
+ can a teammate rebuild the exact model artifact that the commit describes?
188
+ Answer yes by pinning three things together:
189
+
190
+ - **Code** — the git commit itself.
191
+ - **Data** — a `.dvc` pointer, an LFS object, or a committed Parquet file,
192
+ all content-hashed.
193
+ - **Environment** — a pinned `requirements.txt`, `pyproject.toml`, or
194
+ `conda-lock.yml` committed in the same commit.
195
+
196
+ If any one of those three is missing, reproducibility is accidental. The
197
+ versioning tool you pick is less important than treating the three as a
198
+ single atomic unit — changed together, reviewed together, reverted together.
@@ -0,0 +1,159 @@
1
+ ---
2
+ name: data-science-dev-environment
3
+ description: Reproducible local Python dev environment for data science using uv, direnv, pre-commit, and pyproject.toml
4
+ topics: [data-science, dev-environment, uv, direnv, pre-commit]
5
+ ---
6
+
7
+ A data-science project that cannot be recreated in minutes is a liability. Notebooks pick up stale package versions, secrets leak into `.bashrc`, and "works on my machine" kills any chance of a collaborator (or future-you) rerunning an experiment. The fix is not complicated: one lockfile, one place for env vars, one pre-commit hook, no bespoke shell scripts. This guide is opinionated toward solo and small-team workflows where local-first beats container-first.
8
+
9
+ ## Summary
10
+
11
+ Use `uv` as the single Python package manager — it replaces `pip`, `pip-tools`, `venv`, and `virtualenv` with one fast, reproducible tool. Declare every dependency in `pyproject.toml` so there is exactly one source of truth, and commit `uv.lock` so `uv sync` gives any collaborator a byte-identical environment. Layer `direnv` on top for per-repo environment variables (tracking URIs, data paths, secrets pulled from a vault) so nothing leaks into your global shell. Add `pre-commit` with a small set of fast hooks (`ruff-format`, `ruff-check`, end-of-file fixer) so style and obvious bugs never enter a commit. Skip Docker for greenfield solo DS work — reach for it only when you cross an OS boundary (Mac dev, Linux prod) or depend on GPU/CUDA libraries.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### uv for Python environment and dependencies
16
+
17
+ `uv` is the 2025 default for Python packaging. It is a drop-in replacement for `pip`, `venv`, and `pip-tools`, written in Rust, and roughly 10-100x faster than the tools it replaces. For data science the combination of `uv sync` (reproduces the environment from the lockfile) and `uv run` (executes a script in the managed venv without activation) is the whole workflow.
18
+
19
+ Bootstrap a new project:
20
+
21
+ ```bash
22
+ uv init --python 3.12 myproject
23
+ cd myproject
24
+ uv add pandas numpy scikit-learn jupyterlab
25
+ uv add --dev ruff pytest pandera
26
+ uv sync # creates .venv and installs everything
27
+ uv run pytest # runs in the managed venv, no activation needed
28
+ uv run jupyter lab
29
+ ```
30
+
31
+ A minimal `pyproject.toml`:
32
+
33
+ ```toml
34
+ [project]
35
+ name = "myproject"
36
+ version = "0.1.0"
37
+ description = "Customer churn analysis"
38
+ requires-python = ">=3.12"
39
+ dependencies = [
40
+ "pandas>=2.2",
41
+ "numpy>=2.0",
42
+ "scikit-learn>=1.5",
43
+ "jupyterlab>=4.2",
44
+ "pandera>=0.20",
45
+ ]
46
+
47
+ [dependency-groups]
48
+ dev = [
49
+ "ruff>=0.6",
50
+ "pytest>=8.0",
51
+ "pre-commit>=3.8",
52
+ ]
53
+
54
+ [tool.ruff]
55
+ line-length = 100
56
+ target-version = "py312"
57
+
58
+ [tool.ruff.lint]
59
+ select = ["E", "F", "I", "B", "UP", "PD"] # PD = pandas-vet
60
+ ```
61
+
62
+ Commit both `pyproject.toml` and `uv.lock`. A collaborator clones the repo and runs `uv sync` — that is the entire setup step. No `pip install -r requirements.txt`, no virtualenv activation, no version drift.
63
+
64
+ Add a package with `uv add <name>`; remove with `uv remove <name>`. Both edit `pyproject.toml` and update `uv.lock` atomically. To pin a version use `uv add "pandas==2.2.3"`. To upgrade run `uv lock --upgrade-package pandas`.
65
+
66
+ Two `uv` features worth knowing for data science specifically:
67
+
68
+ - **`uv run script.py`** executes a file in the project venv with no activation step. Wire this into a `Makefile` or `justfile` so `make train` and `make eval` Just Work for any collaborator.
69
+ - **Inline script metadata (PEP 723).** For one-off analysis scripts that live outside the project, a shebang-style header declares dependencies and `uv run` auto-creates an ephemeral venv:
70
+
71
+ ```python
72
+ # /// script
73
+ # requires-python = ">=3.12"
74
+ # dependencies = ["pandas", "duckdb"]
75
+ # ///
76
+ import pandas as pd
77
+ import duckdb
78
+ ...
79
+ ```
80
+
81
+ Running `uv run oneoff.py` resolves and caches those deps transparently. No more "should I add this to `pyproject.toml`?" for throwaway exploration.
82
+
83
+ ### direnv for env vars
84
+
85
+ `direnv` loads a per-directory `.envrc` file whenever you `cd` into the project. It keeps secrets and tracking URIs out of your global shell and ensures every terminal session sees the same variables. Skip it if your project has no environment variables; add it the first time you reach for one.
86
+
87
+ Install once (`brew install direnv`, then hook into your shell per the docs). In the project:
88
+
89
+ ```bash
90
+ # .envrc — commit this file
91
+ use python .venv/bin/python # pin Python to the uv-managed venv
92
+ source_up # inherit vars from parent .envrc if present
93
+
94
+ # Experiment tracking
95
+ export MLFLOW_TRACKING_URI="http://localhost:5000"
96
+
97
+ # Data paths (relative to repo root)
98
+ export DATA_DIR="$PWD/data"
99
+ export MODELS_DIR="$PWD/models"
100
+
101
+ # Make imports work without installing the package
102
+ export PYTHONPATH="$PWD/src:$PYTHONPATH"
103
+
104
+ # Secrets — source from a local-only file, never commit
105
+ [[ -f .envrc.local ]] && source_env .envrc.local
106
+ ```
107
+
108
+ Add `.envrc.local` to `.gitignore` and put any actual secrets there (API keys, DB passwords). Run `direnv allow` once after creating or editing `.envrc`; `direnv` will refuse to load until you do. The moment you `cd` out of the project, all variables unload — no pollution.
109
+
110
+ ### pre-commit hooks
111
+
112
+ `pre-commit` runs a configured set of checks every time you `git commit`. Keep the hook list short and fast — anything slower than a second or two trains you to use `--no-verify`, which defeats the point. For data science the right starter set is format, lint, and a couple of sanity hooks.
113
+
114
+ ```yaml
115
+ # .pre-commit-config.yaml
116
+ repos:
117
+ - repo: https://github.com/astral-sh/ruff-pre-commit
118
+ rev: v0.6.9
119
+ hooks:
120
+ - id: ruff-format
121
+ - id: ruff-check
122
+ args: [--fix]
123
+
124
+ - repo: https://github.com/pre-commit/pre-commit-hooks
125
+ rev: v5.0.0
126
+ hooks:
127
+ - id: end-of-file-fixer
128
+ - id: trailing-whitespace
129
+ - id: check-yaml
130
+ - id: check-added-large-files
131
+ args: [--maxkb=500] # block accidental dataset commits
132
+
133
+ - repo: https://github.com/kynan/nbstripout
134
+ rev: 0.7.1
135
+ hooks:
136
+ - id: nbstripout # strips notebook outputs before commit
137
+ ```
138
+
139
+ Install the git hook once per clone:
140
+
141
+ ```bash
142
+ uv run pre-commit install
143
+ uv run pre-commit run --all-files # bootstrap: fix everything now
144
+ ```
145
+
146
+ Deliberately excluded from this set: `mypy`, `pytest`, and `bandit`. They are all valuable, but they are slow enough that they belong in CI, not in the commit path. Fast local, thorough remote.
147
+
148
+ ### When to add Docker
149
+
150
+ For greenfield solo data science, Docker is overhead you do not need. `uv sync` already gives you reproducibility on any machine with the same OS, and local iteration is faster without a container layer.
151
+
152
+ Reach for Docker only when one of these is true:
153
+
154
+ - **OS mismatch between dev and prod.** You develop on macOS but the model runs on Linux in production, and a native dependency (e.g. a C extension, a specific `libgomp`) behaves differently across platforms.
155
+ - **GPU / CUDA dependencies.** CUDA toolkit versions are tightly coupled to driver versions and OS. A pinned `nvidia/cuda` base image is the only sane way to guarantee training reproducibility across machines.
156
+ - **Handoff to MLOps or serving infra.** Production deployment targets (SageMaker, Vertex, KServe, plain Kubernetes) expect a container. Build one at the handoff boundary, not before.
157
+ - **Onboarding collaborators with hostile local setups.** A Windows colleague who cannot install `uv` natively is a reasonable reason to ship a devcontainer.
158
+
159
+ When you do add Docker, keep it thin: copy `pyproject.toml` and `uv.lock`, run `uv sync --frozen`, and let the same lockfile drive both local and container builds. That way the container is a packaging detail, not a parallel source of truth.