@zigrivers/scaffold 3.22.0 → 3.24.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +44 -23
- package/content/knowledge/core/automated-review-tooling.md +3 -3
- package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
- package/content/knowledge/data-science/README.md +23 -0
- package/content/knowledge/data-science/data-science-architecture.md +163 -0
- package/content/knowledge/data-science/data-science-conventions.md +233 -0
- package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
- package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
- package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
- package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
- package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
- package/content/knowledge/data-science/data-science-observability.md +161 -0
- package/content/knowledge/data-science/data-science-project-structure.md +178 -0
- package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
- package/content/knowledge/data-science/data-science-requirements.md +151 -0
- package/content/knowledge/data-science/data-science-security.md +151 -0
- package/content/knowledge/data-science/data-science-testing.md +183 -0
- package/content/knowledge/ml/README.md +10 -0
- package/content/methodology/data-science-overlay.yml +39 -0
- package/content/pipeline/build/multi-agent-resume.md +7 -6
- package/content/pipeline/build/multi-agent-start.md +7 -6
- package/content/pipeline/build/single-agent-resume.md +7 -6
- package/content/pipeline/build/single-agent-start.md +7 -6
- package/content/pipeline/environment/automated-pr-review.md +79 -27
- package/content/skills/mmr/SKILL.md +72 -2
- package/content/skills/scaffold-runner/SKILL.md +65 -19
- package/content/tools/review-code.md +74 -16
- package/content/tools/review-pr.md +25 -6
- package/dist/cli/commands/check.d.ts.map +1 -1
- package/dist/cli/commands/check.js +28 -17
- package/dist/cli/commands/check.js.map +1 -1
- package/dist/config/schema.d.ts +672 -126
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +8 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +2 -2
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/data-science.d.ts +4 -0
- package/dist/config/validators/data-science.d.ts.map +1 -0
- package/dist/config/validators/data-science.js +15 -0
- package/dist/config/validators/data-science.js.map +1 -0
- package/dist/config/validators/index.d.ts.map +1 -1
- package/dist/config/validators/index.js +2 -0
- package/dist/config/validators/index.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
- package/dist/core/assembly/knowledge-loader.js +6 -0
- package/dist/core/assembly/knowledge-loader.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.test.js +34 -0
- package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +73 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/project/adopt.d.ts.map +1 -1
- package/dist/project/adopt.js +3 -1
- package/dist/project/adopt.js.map +1 -1
- package/dist/project/detectors/coverage.test.d.ts +2 -0
- package/dist/project/detectors/coverage.test.d.ts.map +1 -0
- package/dist/project/detectors/coverage.test.js +78 -0
- package/dist/project/detectors/coverage.test.js.map +1 -0
- package/dist/project/detectors/data-science.d.ts +4 -0
- package/dist/project/detectors/data-science.d.ts.map +1 -0
- package/dist/project/detectors/data-science.js +32 -0
- package/dist/project/detectors/data-science.js.map +1 -0
- package/dist/project/detectors/data-science.test.d.ts +2 -0
- package/dist/project/detectors/data-science.test.d.ts.map +1 -0
- package/dist/project/detectors/data-science.test.js +62 -0
- package/dist/project/detectors/data-science.test.js.map +1 -0
- package/dist/project/detectors/disambiguate.d.ts +2 -0
- package/dist/project/detectors/disambiguate.d.ts.map +1 -1
- package/dist/project/detectors/disambiguate.js +3 -2
- package/dist/project/detectors/disambiguate.js.map +1 -1
- package/dist/project/detectors/disambiguate.test.js +10 -1
- package/dist/project/detectors/disambiguate.test.js.map +1 -1
- package/dist/project/detectors/index.d.ts.map +1 -1
- package/dist/project/detectors/index.js +2 -0
- package/dist/project/detectors/index.js.map +1 -1
- package/dist/project/detectors/library.d.ts.map +1 -1
- package/dist/project/detectors/library.js +1 -0
- package/dist/project/detectors/library.js.map +1 -1
- package/dist/project/detectors/resolve-detection.test.js +31 -0
- package/dist/project/detectors/resolve-detection.test.js.map +1 -1
- package/dist/project/detectors/types.d.ts +6 -2
- package/dist/project/detectors/types.d.ts.map +1 -1
- package/dist/project/detectors/types.js.map +1 -1
- package/dist/types/config.d.ts +8 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/copy/core.d.ts.map +1 -1
- package/dist/wizard/copy/core.js +4 -0
- package/dist/wizard/copy/core.js.map +1 -1
- package/dist/wizard/copy/data-science.d.ts +3 -0
- package/dist/wizard/copy/data-science.d.ts.map +1 -0
- package/dist/wizard/copy/data-science.js +15 -0
- package/dist/wizard/copy/data-science.js.map +1 -0
- package/dist/wizard/copy/index.d.ts.map +1 -1
- package/dist/wizard/copy/index.js +2 -0
- package/dist/wizard/copy/index.js.map +1 -1
- package/dist/wizard/copy/types.d.ts +5 -1
- package/dist/wizard/copy/types.d.ts.map +1 -1
- package/dist/wizard/copy/types.test-d.js +7 -0
- package/dist/wizard/copy/types.test-d.js.map +1 -1
- package/dist/wizard/questions.d.ts +2 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +9 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +14 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +1 -0
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
- package/skills/mmr/SKILL.md +72 -2
- package/skills/scaffold-runner/SKILL.md +65 -19
|
@@ -0,0 +1,233 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-conventions
|
|
3
|
+
description: Python coding conventions for solo data-science work — ruff for lint+format, pragmatic type hints, pyproject.toml as single config source, import ordering, module layout, naming, and docstrings
|
|
4
|
+
topics: [data-science, conventions, python, ruff, type-hints]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Solo data-science code drifts faster than any other kind of Python: half of it lives in notebooks, the other half migrates into scripts, and nothing stays stable long enough to earn a style review. Consistent conventions are the only thing that keeps cognitive load bounded when you come back to a project after two months. Encode them in tooling (`ruff`, `pyproject.toml`) so they run on save — not on willpower — and the notebook→script promotion path stays smooth instead of becoming a cleanup tax.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use `ruff` as the single lint + format tool — `ruff format` is Black-compatible and replaces Black, so do not install both. Apply `type hints` pragmatically: typed on any function another module imports, omitted on throwaway notebook helpers. Centralize all project and tool configuration in `pyproject.toml` — one file for build metadata, dependencies, ruff, and pytest. Use `ruff`/`isort`-style import sections (stdlib → third-party → local), a flat `src/` layout with a clear module split, and docstrings sized to the consumer: one-liners for internal helpers, full Google/NumPy style for anything a teammate will call without reading the source.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Linter + formatter (ruff)
|
|
16
|
+
|
|
17
|
+
`ruff` is the only Python linter/formatter a solo DS project needs. It replaces `flake8`, `isort`, `pyupgrade`, `pydocstyle`, `pylint` (mostly), and — via `ruff format` — Black. It is an order of magnitude faster than the tools it replaces, configured in one `[tool.ruff]` block, and has no plugin-management overhead. Do not layer Black on top: `ruff format` implements the same formatting contract, and running both just causes churn.
|
|
18
|
+
|
|
19
|
+
```toml
|
|
20
|
+
# pyproject.toml
|
|
21
|
+
[tool.ruff]
|
|
22
|
+
line-length = 100
|
|
23
|
+
target-version = "py311"
|
|
24
|
+
extend-exclude = ["notebooks/_scratch", "data", "models"]
|
|
25
|
+
|
|
26
|
+
[tool.ruff.lint]
|
|
27
|
+
select = [
|
|
28
|
+
"E", # pycodestyle errors
|
|
29
|
+
"W", # pycodestyle warnings
|
|
30
|
+
"F", # pyflakes
|
|
31
|
+
"I", # isort (import sorting)
|
|
32
|
+
"N", # pep8-naming
|
|
33
|
+
"UP", # pyupgrade
|
|
34
|
+
"B", # flake8-bugbear
|
|
35
|
+
"C90", # mccabe complexity
|
|
36
|
+
"D", # pydocstyle
|
|
37
|
+
]
|
|
38
|
+
ignore = [
|
|
39
|
+
"D100", # missing docstring in public module — noisy for scripts
|
|
40
|
+
"D104", # missing docstring in public package
|
|
41
|
+
"E501", # line-too-long — formatter handles it
|
|
42
|
+
]
|
|
43
|
+
|
|
44
|
+
[tool.ruff.lint.per-file-ignores]
|
|
45
|
+
# Notebooks and experiment scripts get a lighter hand
|
|
46
|
+
"notebooks/**/*.py" = ["D", "N806", "E402"]
|
|
47
|
+
"scripts/**/*.py" = ["D"]
|
|
48
|
+
"tests/**/*.py" = ["D"]
|
|
49
|
+
|
|
50
|
+
[tool.ruff.lint.pydocstyle]
|
|
51
|
+
convention = "google"
|
|
52
|
+
|
|
53
|
+
[tool.ruff.lint.mccabe]
|
|
54
|
+
max-complexity = 12
|
|
55
|
+
|
|
56
|
+
[tool.ruff.format]
|
|
57
|
+
quote-style = "double"
|
|
58
|
+
indent-style = "space"
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**Tradeoff**: notebook and exploration code legitimately breaks rules that production code should not — uppercase variable names (`X_train`), imports after executable code, no docstrings. The `per-file-ignores` block disables the rules that fight notebook workflows without weakening the defaults for `src/`. Do not globally ignore `D` or `N` just to silence notebook noise.
|
|
62
|
+
|
|
63
|
+
Run on save (editor integration) and as a pre-commit hook. In CI, run `ruff check .` and `ruff format --check .` — the `--check` flag fails instead of rewriting.
|
|
64
|
+
|
|
65
|
+
### Type hints
|
|
66
|
+
|
|
67
|
+
Python is not a typed language, and pretending it is in exploratory code wastes time. The rule is **import boundary = type boundary**: if another module imports the function, type it. Notebook-local helpers and inline lambdas do not need annotations.
|
|
68
|
+
|
|
69
|
+
```python
|
|
70
|
+
# src/features/encoders.py — imported by training and serving, fully typed
|
|
71
|
+
from __future__ import annotations
|
|
72
|
+
|
|
73
|
+
import numpy as np
|
|
74
|
+
import pandas as pd
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
def target_encode(
|
|
78
|
+
series: pd.Series,
|
|
79
|
+
target: pd.Series,
|
|
80
|
+
smoothing: float = 10.0,
|
|
81
|
+
) -> pd.Series:
|
|
82
|
+
"""Smoothed target encoding for a categorical feature.
|
|
83
|
+
|
|
84
|
+
Args:
|
|
85
|
+
series: Categorical feature values (any hashable dtype).
|
|
86
|
+
target: Numeric target aligned to `series` by index.
|
|
87
|
+
smoothing: Prior weight; higher values pull rare categories
|
|
88
|
+
toward the global mean.
|
|
89
|
+
|
|
90
|
+
Returns:
|
|
91
|
+
Series of encoded floats aligned to `series.index`.
|
|
92
|
+
"""
|
|
93
|
+
global_mean = target.mean()
|
|
94
|
+
agg = target.groupby(series).agg(["mean", "count"])
|
|
95
|
+
weight = agg["count"] / (agg["count"] + smoothing)
|
|
96
|
+
encoding = weight * agg["mean"] + (1 - weight) * global_mean
|
|
97
|
+
return series.map(encoding).astype(np.float64)
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
```python
|
|
101
|
+
# notebooks/03_eda.py — throwaway scratch, no annotations needed
|
|
102
|
+
def quick_hist(col):
|
|
103
|
+
return df[col].value_counts().head(20)
|
|
104
|
+
|
|
105
|
+
for c in cat_cols:
|
|
106
|
+
print(c, quick_hist(c).to_dict())
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Practical rules:
|
|
110
|
+
- Type every function exported from `src/` — parameters and return.
|
|
111
|
+
- Type dataclasses and `TypedDict` schemas that describe data contracts (row shapes, config dicts).
|
|
112
|
+
- Skip annotations on notebook cells, inline closures, and private helpers inside a single script.
|
|
113
|
+
- Use `from __future__ import annotations` at the top of every `src/` file — it makes all annotations lazy strings, so forward references and expensive-to-import types (`torch.Tensor`, `pd.DataFrame`) cost nothing at import time.
|
|
114
|
+
- Do not run `mypy --strict` on a solo DS project. Run it on `src/` with `--ignore-missing-imports` if you want a safety net, and do not bother with notebooks.
|
|
115
|
+
|
|
116
|
+
### Project layout and pyproject.toml
|
|
117
|
+
|
|
118
|
+
One `pyproject.toml` at the repo root configures the build, dependencies, lint, format, and tests. Do not scatter config across `setup.cfg`, `.flake8`, `.isort.cfg`, and `pytest.ini` — everything lives in `pyproject.toml`.
|
|
119
|
+
|
|
120
|
+
```toml
|
|
121
|
+
# pyproject.toml
|
|
122
|
+
[build-system]
|
|
123
|
+
requires = ["hatchling"]
|
|
124
|
+
build-backend = "hatchling.build"
|
|
125
|
+
|
|
126
|
+
[project]
|
|
127
|
+
name = "churn-model"
|
|
128
|
+
version = "0.1.0"
|
|
129
|
+
description = "Customer churn prediction — feature pipeline, training, and serving."
|
|
130
|
+
requires-python = ">=3.11"
|
|
131
|
+
dependencies = [
|
|
132
|
+
"pandas>=2.1",
|
|
133
|
+
"numpy>=1.26",
|
|
134
|
+
"scikit-learn>=1.4",
|
|
135
|
+
"pydantic>=2.5",
|
|
136
|
+
]
|
|
137
|
+
|
|
138
|
+
[project.optional-dependencies]
|
|
139
|
+
dev = [
|
|
140
|
+
"ruff>=0.3",
|
|
141
|
+
"pytest>=8.0",
|
|
142
|
+
"pytest-cov>=4.1",
|
|
143
|
+
"ipykernel>=6.29",
|
|
144
|
+
]
|
|
145
|
+
|
|
146
|
+
[tool.ruff]
|
|
147
|
+
line-length = 100
|
|
148
|
+
target-version = "py311"
|
|
149
|
+
# ... (see ruff section above)
|
|
150
|
+
|
|
151
|
+
[tool.pytest.ini_options]
|
|
152
|
+
testpaths = ["tests"]
|
|
153
|
+
addopts = "-ra --strict-markers --cov=src --cov-report=term-missing"
|
|
154
|
+
markers = [
|
|
155
|
+
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
|
|
156
|
+
]
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Repo layout:
|
|
160
|
+
|
|
161
|
+
```
|
|
162
|
+
churn-model/
|
|
163
|
+
pyproject.toml
|
|
164
|
+
README.md
|
|
165
|
+
src/
|
|
166
|
+
churn_model/
|
|
167
|
+
__init__.py
|
|
168
|
+
data/ # loaders, schemas, splits
|
|
169
|
+
features/ # transformers, encoders, selection
|
|
170
|
+
models/ # model definitions and wrappers
|
|
171
|
+
training/ # train loops, CV runners
|
|
172
|
+
evaluation/ # metrics, diagnostics
|
|
173
|
+
serving/ # inference helpers
|
|
174
|
+
notebooks/
|
|
175
|
+
01_data_audit.ipynb
|
|
176
|
+
02_feature_exploration.ipynb
|
|
177
|
+
tests/
|
|
178
|
+
test_features.py
|
|
179
|
+
test_training.py
|
|
180
|
+
configs/
|
|
181
|
+
base.yaml
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Use a `src/` layout (not flat) so imports always go through the installed package — this prevents the "works in notebook, breaks in test" failure mode where `from my_module import x` resolves from the CWD instead of the package.
|
|
185
|
+
|
|
186
|
+
### Import ordering
|
|
187
|
+
|
|
188
|
+
`ruff` with rule `I` enforces `isort`-compatible sections automatically. The contract:
|
|
189
|
+
|
|
190
|
+
1. Future imports (`from __future__ import annotations`)
|
|
191
|
+
2. Standard library
|
|
192
|
+
3. Third-party
|
|
193
|
+
4. First-party (your package)
|
|
194
|
+
5. Local relative (`from .utils import ...`)
|
|
195
|
+
|
|
196
|
+
One blank line between sections, alphabetical within each. Do not hand-maintain this — `ruff check --fix` sorts imports in milliseconds.
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
from __future__ import annotations
|
|
200
|
+
|
|
201
|
+
import json
|
|
202
|
+
from pathlib import Path
|
|
203
|
+
|
|
204
|
+
import numpy as np
|
|
205
|
+
import pandas as pd
|
|
206
|
+
from sklearn.model_selection import KFold
|
|
207
|
+
|
|
208
|
+
from churn_model.data import load_raw
|
|
209
|
+
from churn_model.features import target_encode
|
|
210
|
+
|
|
211
|
+
from .utils import timed
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### Naming and docstrings
|
|
215
|
+
|
|
216
|
+
Naming rubric (enforced by `ruff` rule `N`):
|
|
217
|
+
|
|
218
|
+
- **Modules/files**: `snake_case.py` (`feature_store.py`, not `FeatureStore.py`).
|
|
219
|
+
- **Functions/variables**: `snake_case` (`compute_auc`, `n_splits`).
|
|
220
|
+
- **Classes**: `PascalCase` (`ChurnDataset`, `TargetEncoder`).
|
|
221
|
+
- **Constants**: `UPPER_SNAKE_CASE` at module top level (`DEFAULT_SEED = 42`, `FEATURE_COLUMNS: tuple[str, ...] = (...)`).
|
|
222
|
+
- **Private**: single leading underscore (`_internal_helper`). Double underscore only when you specifically want name-mangling inside a class.
|
|
223
|
+
- **Type variables**: `PascalCase` with suffix (`ModelT = TypeVar("ModelT")`).
|
|
224
|
+
- **DataFrame matrices**: `X`, `y`, `X_train`, `y_test` are the one permitted uppercase exception — this is ML convention and `ruff` can be told to allow it via `N806` ignore in model/training modules.
|
|
225
|
+
|
|
226
|
+
Docstring style sizing — match the cost of writing the docstring to the consumer:
|
|
227
|
+
|
|
228
|
+
- **Terse one-liner** for private helpers and obvious utilities. `"""Return the 95th percentile of non-null values."""` is enough.
|
|
229
|
+
- **Full Google-style** (Args/Returns/Raises) for any public function in `src/features/`, `src/models/`, or `src/serving/` — anything a teammate or future-you will call without opening the source. See the `target_encode` example above.
|
|
230
|
+
- **Module docstring** on every `src/` module: one sentence describing what lives there. Skip on `scripts/` and `notebooks/`.
|
|
231
|
+
- **Class docstring** covers the class contract; `__init__` args go in the class docstring, not a separate `__init__` docstring. (This is the Google convention and `ruff`'s `pydocstyle` setting enforces it.)
|
|
232
|
+
|
|
233
|
+
Pick Google **or** NumPy style — not both — and set it in `[tool.ruff.lint.pydocstyle]`. Google is more compact and reads better in IDE hover; NumPy is better when you have long parameter descriptions with math. For solo DS, Google is the default recommendation.
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-data-versioning
|
|
3
|
+
description: When and how to version data for reproducibility — size-based rule for choosing between git+Parquet, git-lfs, and DVC
|
|
4
|
+
topics: [data-science, data-versioning, dvc, parquet, git-lfs]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
If you can't answer "what data produced this result", you can't reproduce it.
|
|
8
|
+
A model trained on 2026-02-14's snapshot will drift from one trained today,
|
|
9
|
+
and without a versioning story you have no way to roll back, diff, or explain
|
|
10
|
+
the difference to a reviewer six months later.
|
|
11
|
+
|
|
12
|
+
Data versioning answers that question without blowing up the repo — the trick
|
|
13
|
+
is picking a tool proportional to the dataset size. The common failure mode
|
|
14
|
+
is over-engineering (wiring up DVC with a remote for a 40 MB CSV) or
|
|
15
|
+
under-engineering (committing 2 GB of Parquet directly into git and
|
|
16
|
+
discovering three months later that every clone takes twenty minutes).
|
|
17
|
+
|
|
18
|
+
## Summary
|
|
19
|
+
|
|
20
|
+
Pick your tool by size.
|
|
21
|
+
|
|
22
|
+
- Under ~1 GB of text or Parquet, plain git with committed Parquet files is fine.
|
|
23
|
+
- Between 1 and 10 GB, use `git-lfs` if you already use it on the team;
|
|
24
|
+
otherwise invest in `DVC` (Data Version Control).
|
|
25
|
+
- Above 10 GB — or for any binary artifact (model weights, image corpora,
|
|
26
|
+
audio) — always use DVC with a remote (s3, gcs, azure).
|
|
27
|
+
- Never version raw third-party data that you can't legally redistribute —
|
|
28
|
+
store a fetch script and a content hash instead.
|
|
29
|
+
|
|
30
|
+
## Deep Guidance
|
|
31
|
+
|
|
32
|
+
### Size-based decision rule
|
|
33
|
+
|
|
34
|
+
| Dataset | Tool | Why |
|
|
35
|
+
|---------|------|-----|
|
|
36
|
+
| <1 GB text / Parquet | git + Parquet | Columnar compression keeps files small; git history stays sane |
|
|
37
|
+
| 1–10 GB (judgment call) | `git-lfs` if already adopted; DVC if you have the habit | LFS is lower-effort; DVC gives you pipeline stages too |
|
|
38
|
+
| >10 GB or binary artifacts | DVC with remote | Git history will not tolerate binary churn at this scale |
|
|
39
|
+
| Raw third-party data | Don't version — script + hash | Redistribution is often prohibited; raw bytes bloat history |
|
|
40
|
+
|
|
41
|
+
The sizes above are rules of thumb, not hard thresholds. What actually
|
|
42
|
+
matters is how often the data changes. A 5 GB file that you generate once
|
|
43
|
+
and never touch again can live in `git-lfs` forever without pain. The same
|
|
44
|
+
5 GB file regenerated weekly will accumulate 260 GB of LFS storage in a
|
|
45
|
+
year — that's the point where DVC's content-addressed cache starts to earn
|
|
46
|
+
its complexity.
|
|
47
|
+
|
|
48
|
+
A second factor is team shape. A solo researcher on a laptop rarely needs a
|
|
49
|
+
remote backing store; a two-person team on different continents almost
|
|
50
|
+
always does. Choose the tool that fits the smallest real collaboration
|
|
51
|
+
pattern you have, not the one that scales to the team you imagine having.
|
|
52
|
+
|
|
53
|
+
### When git + Parquet is enough
|
|
54
|
+
|
|
55
|
+
For a solo or small-team project with modest data, commit processed Parquet directly. Keep raw data out of the repo; reserve git for cleaned, analysis-ready files.
|
|
56
|
+
|
|
57
|
+
```python
|
|
58
|
+
# src/pipelines/clean.py
|
|
59
|
+
import pandas as pd
|
|
60
|
+
|
|
61
|
+
df = pd.read_csv("data/raw/events.csv") # data/raw/ is gitignored
|
|
62
|
+
clean = df.dropna(subset=["user_id"]).assign(ts=pd.to_datetime(df["ts"]))
|
|
63
|
+
clean.to_parquet("data/interim/events_clean.parquet", compression="zstd")
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
```gitignore
|
|
67
|
+
# .gitignore
|
|
68
|
+
data/raw/
|
|
69
|
+
data/external/
|
|
70
|
+
*.csv
|
|
71
|
+
!data/interim/*.parquet # do commit processed Parquet
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Parquet's columnar layout and zstd compression typically shrink tabular data
|
|
75
|
+
5–10x versus CSV. Diffs aren't line-level but file-level content hashes are
|
|
76
|
+
stable, which is enough for "which version produced this model".
|
|
77
|
+
|
|
78
|
+
Pair the committed Parquet with a short data card — a markdown file in
|
|
79
|
+
`data/interim/events_clean.md` describing row count, schema, source, and the
|
|
80
|
+
commit that generated it — so readers of the repo a year later can tell what
|
|
81
|
+
they're looking at.
|
|
82
|
+
|
|
83
|
+
### DVC basics
|
|
84
|
+
|
|
85
|
+
DVC treats large files as pointers tracked in git. The real bytes live on a remote (s3/gcs/azure/ssh), and a small `.dvc` metadata file is committed.
|
|
86
|
+
|
|
87
|
+
```yaml
|
|
88
|
+
# dvc.yaml — pipeline stages with content-hashed inputs and outputs
|
|
89
|
+
stages:
|
|
90
|
+
ingest:
|
|
91
|
+
cmd: python src/ingest.py --out data/raw/events.parquet
|
|
92
|
+
outs:
|
|
93
|
+
- data/raw/events.parquet
|
|
94
|
+
process:
|
|
95
|
+
cmd: python src/process.py --in data/raw/events.parquet --out data/processed/features.parquet
|
|
96
|
+
deps:
|
|
97
|
+
- src/process.py
|
|
98
|
+
- data/raw/events.parquet
|
|
99
|
+
outs:
|
|
100
|
+
- data/processed/features.parquet
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Typical flow:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
dvc init # creates .dvc/ directory
|
|
107
|
+
dvc remote add -d storage s3://my-bucket/dvc-store
|
|
108
|
+
dvc add data/raw/big_dataset.csv # creates data/raw/big_dataset.csv.dvc (commit this)
|
|
109
|
+
dvc repro # runs stages whose inputs changed
|
|
110
|
+
dvc push # upload tracked files to remote
|
|
111
|
+
git add dvc.yaml dvc.lock data/raw/big_dataset.csv.dvc .gitignore
|
|
112
|
+
git commit -m "track raw events via DVC"
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
`dvc.lock` records the content hash of every stage input and output, so
|
|
116
|
+
`dvc repro` on a peer's machine rebuilds exactly what you rebuilt. The
|
|
117
|
+
`.dvc/` directory holds local cache and config; the actual bytes never touch
|
|
118
|
+
git.
|
|
119
|
+
|
|
120
|
+
Mental model: git for code, DVC for data, both pointing at the same commit.
|
|
121
|
+
When you check out an older branch, git restores the source and the `.dvc`
|
|
122
|
+
pointers, and `dvc checkout` pulls matching data from the remote into your
|
|
123
|
+
working tree.
|
|
124
|
+
|
|
125
|
+
A common starting point: track one or two heavy inputs with `dvc add` (no
|
|
126
|
+
pipeline), and only adopt `dvc.yaml` stages once you have a repeatable
|
|
127
|
+
multi-step workflow. The overhead of stages pays off when you have 3+ steps
|
|
128
|
+
and want `dvc repro` to skip unchanged work; below that, plain `dvc add`
|
|
129
|
+
plus a Makefile is often clearer.
|
|
130
|
+
|
|
131
|
+
### git-lfs middle ground
|
|
132
|
+
|
|
133
|
+
If you're already using Git LFS on the team but not ready to adopt DVC, it works acceptably for the 1–10 GB band — especially for a handful of files over the 100 MB GitHub push limit.
|
|
134
|
+
|
|
135
|
+
```gitattributes
|
|
136
|
+
# .gitattributes
|
|
137
|
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
|
138
|
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
|
139
|
+
data/models/** filter=lfs diff=lfs merge=lfs -text
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
git lfs install
|
|
144
|
+
git lfs track "*.parquet"
|
|
145
|
+
git add .gitattributes data/features.parquet
|
|
146
|
+
git commit -m "add feature table via LFS"
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Reach for `git-lfs` when files are over ~100 MB but you don't need DVC's
|
|
150
|
+
pipeline stages or content-addressed reproducibility. Skip it if you already
|
|
151
|
+
have DVC set up — two tools versioning the same data is a recipe for
|
|
152
|
+
confusion.
|
|
153
|
+
|
|
154
|
+
LFS has real drawbacks to know about: bandwidth is metered on hosted plans,
|
|
155
|
+
`git clone` pulls every LFS object by default (use `GIT_LFS_SKIP_SMUDGE=1`
|
|
156
|
+
to defer), and you can't selectively prune history without rewriting the
|
|
157
|
+
whole repo. For a working group of 2–5 people on a research project these
|
|
158
|
+
are usually tolerable; for a fleet of CI workers cloning on every build
|
|
159
|
+
they are not.
|
|
160
|
+
|
|
161
|
+
### What not to version
|
|
162
|
+
|
|
163
|
+
- **Third-party data with license constraints** — re-commit a fetch script (`scripts/fetch_kaggle.sh`) and record the SHA256 of the pulled file in a README. Re-download on each environment.
|
|
164
|
+
- **Regenerable intermediates** — if `dvc repro` or `make data` can recreate it deterministically from upstream inputs, don't commit the bytes.
|
|
165
|
+
- **Scratch / exploratory outputs** — `notebooks/scratch/`, `data/tmp/`, `*.ipynb_checkpoints/` belong in `.gitignore`.
|
|
166
|
+
- **Anti-pattern: committing 500 MB Parquet files directly to git** — they live forever in history, clone times balloon, and nobody will clean it up later. Move to DVC or LFS *before* the first large commit, not after. Rewriting history to extract large blobs (`git filter-repo`, BFG) is disruptive to every collaborator and should be a last resort.
|
|
167
|
+
- **Anti-pattern: versioning model checkpoints in git** — a single PyTorch checkpoint can be several hundred MB, and training runs produce dozens. Push them to DVC or an artifact store (MLflow, Weights & Biases) keyed by experiment run ID.
|
|
168
|
+
|
|
169
|
+
### Quick migration path
|
|
170
|
+
|
|
171
|
+
If you're staring at a repo that has already committed large files to plain
|
|
172
|
+
git, the order of operations is:
|
|
173
|
+
|
|
174
|
+
1. Decide the target tool (DVC for most cases where you got here).
|
|
175
|
+
2. Run `dvc add` on the file in its current location — this untracks it
|
|
176
|
+
from git and creates a `.dvc` pointer.
|
|
177
|
+
3. Commit the pointer and the updated `.gitignore`.
|
|
178
|
+
4. Optionally run `git filter-repo` to purge the old blobs from history if
|
|
179
|
+
clone size has become painful.
|
|
180
|
+
|
|
181
|
+
Step 4 requires coordination — everyone must re-clone — so defer it until
|
|
182
|
+
the pain justifies the disruption.
|
|
183
|
+
|
|
184
|
+
### Reproducibility in practice
|
|
185
|
+
|
|
186
|
+
The goal of all of this is a single concrete question: given a git commit,
|
|
187
|
+
can a teammate rebuild the exact model artifact that the commit describes?
|
|
188
|
+
Answer yes by pinning three things together:
|
|
189
|
+
|
|
190
|
+
- **Code** — the git commit itself.
|
|
191
|
+
- **Data** — a `.dvc` pointer, an LFS object, or a committed Parquet file,
|
|
192
|
+
all content-hashed.
|
|
193
|
+
- **Environment** — a pinned `requirements.txt`, `pyproject.toml`, or
|
|
194
|
+
`conda-lock.yml` committed in the same commit.
|
|
195
|
+
|
|
196
|
+
If any one of those three is missing, reproducibility is accidental. The
|
|
197
|
+
versioning tool you pick is less important than treating the three as a
|
|
198
|
+
single atomic unit — changed together, reviewed together, reverted together.
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-dev-environment
|
|
3
|
+
description: Reproducible local Python dev environment for data science using uv, direnv, pre-commit, and pyproject.toml
|
|
4
|
+
topics: [data-science, dev-environment, uv, direnv, pre-commit]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A data-science project that cannot be recreated in minutes is a liability. Notebooks pick up stale package versions, secrets leak into `.bashrc`, and "works on my machine" kills any chance of a collaborator (or future-you) rerunning an experiment. The fix is not complicated: one lockfile, one place for env vars, one pre-commit hook, no bespoke shell scripts. This guide is opinionated toward solo and small-team workflows where local-first beats container-first.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use `uv` as the single Python package manager — it replaces `pip`, `pip-tools`, `venv`, and `virtualenv` with one fast, reproducible tool. Declare every dependency in `pyproject.toml` so there is exactly one source of truth, and commit `uv.lock` so `uv sync` gives any collaborator a byte-identical environment. Layer `direnv` on top for per-repo environment variables (tracking URIs, data paths, secrets pulled from a vault) so nothing leaks into your global shell. Add `pre-commit` with a small set of fast hooks (`ruff-format`, `ruff-check`, end-of-file fixer) so style and obvious bugs never enter a commit. Skip Docker for greenfield solo DS work — reach for it only when you cross an OS boundary (Mac dev, Linux prod) or depend on GPU/CUDA libraries.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### uv for Python environment and dependencies
|
|
16
|
+
|
|
17
|
+
`uv` is the 2025 default for Python packaging. It is a drop-in replacement for `pip`, `venv`, and `pip-tools`, written in Rust, and roughly 10-100x faster than the tools it replaces. For data science the combination of `uv sync` (reproduces the environment from the lockfile) and `uv run` (executes a script in the managed venv without activation) is the whole workflow.
|
|
18
|
+
|
|
19
|
+
Bootstrap a new project:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
uv init --python 3.12 myproject
|
|
23
|
+
cd myproject
|
|
24
|
+
uv add pandas numpy scikit-learn jupyterlab
|
|
25
|
+
uv add --dev ruff pytest pandera
|
|
26
|
+
uv sync # creates .venv and installs everything
|
|
27
|
+
uv run pytest # runs in the managed venv, no activation needed
|
|
28
|
+
uv run jupyter lab
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
A minimal `pyproject.toml`:
|
|
32
|
+
|
|
33
|
+
```toml
|
|
34
|
+
[project]
|
|
35
|
+
name = "myproject"
|
|
36
|
+
version = "0.1.0"
|
|
37
|
+
description = "Customer churn analysis"
|
|
38
|
+
requires-python = ">=3.12"
|
|
39
|
+
dependencies = [
|
|
40
|
+
"pandas>=2.2",
|
|
41
|
+
"numpy>=2.0",
|
|
42
|
+
"scikit-learn>=1.5",
|
|
43
|
+
"jupyterlab>=4.2",
|
|
44
|
+
"pandera>=0.20",
|
|
45
|
+
]
|
|
46
|
+
|
|
47
|
+
[dependency-groups]
|
|
48
|
+
dev = [
|
|
49
|
+
"ruff>=0.6",
|
|
50
|
+
"pytest>=8.0",
|
|
51
|
+
"pre-commit>=3.8",
|
|
52
|
+
]
|
|
53
|
+
|
|
54
|
+
[tool.ruff]
|
|
55
|
+
line-length = 100
|
|
56
|
+
target-version = "py312"
|
|
57
|
+
|
|
58
|
+
[tool.ruff.lint]
|
|
59
|
+
select = ["E", "F", "I", "B", "UP", "PD"] # PD = pandas-vet
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Commit both `pyproject.toml` and `uv.lock`. A collaborator clones the repo and runs `uv sync` — that is the entire setup step. No `pip install -r requirements.txt`, no virtualenv activation, no version drift.
|
|
63
|
+
|
|
64
|
+
Add a package with `uv add <name>`; remove with `uv remove <name>`. Both edit `pyproject.toml` and update `uv.lock` atomically. To pin a version use `uv add "pandas==2.2.3"`. To upgrade run `uv lock --upgrade-package pandas`.
|
|
65
|
+
|
|
66
|
+
Two `uv` features worth knowing for data science specifically:
|
|
67
|
+
|
|
68
|
+
- **`uv run script.py`** executes a file in the project venv with no activation step. Wire this into a `Makefile` or `justfile` so `make train` and `make eval` Just Work for any collaborator.
|
|
69
|
+
- **Inline script metadata (PEP 723).** For one-off analysis scripts that live outside the project, a shebang-style header declares dependencies and `uv run` auto-creates an ephemeral venv:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
# /// script
|
|
73
|
+
# requires-python = ">=3.12"
|
|
74
|
+
# dependencies = ["pandas", "duckdb"]
|
|
75
|
+
# ///
|
|
76
|
+
import pandas as pd
|
|
77
|
+
import duckdb
|
|
78
|
+
...
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Running `uv run oneoff.py` resolves and caches those deps transparently. No more "should I add this to `pyproject.toml`?" for throwaway exploration.
|
|
82
|
+
|
|
83
|
+
### direnv for env vars
|
|
84
|
+
|
|
85
|
+
`direnv` loads a per-directory `.envrc` file whenever you `cd` into the project. It keeps secrets and tracking URIs out of your global shell and ensures every terminal session sees the same variables. Skip it if your project has no environment variables; add it the first time you reach for one.
|
|
86
|
+
|
|
87
|
+
Install once (`brew install direnv`, then hook into your shell per the docs). In the project:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# .envrc — commit this file
|
|
91
|
+
use python .venv/bin/python # pin Python to the uv-managed venv
|
|
92
|
+
source_up # inherit vars from parent .envrc if present
|
|
93
|
+
|
|
94
|
+
# Experiment tracking
|
|
95
|
+
export MLFLOW_TRACKING_URI="http://localhost:5000"
|
|
96
|
+
|
|
97
|
+
# Data paths (relative to repo root)
|
|
98
|
+
export DATA_DIR="$PWD/data"
|
|
99
|
+
export MODELS_DIR="$PWD/models"
|
|
100
|
+
|
|
101
|
+
# Make imports work without installing the package
|
|
102
|
+
export PYTHONPATH="$PWD/src:$PYTHONPATH"
|
|
103
|
+
|
|
104
|
+
# Secrets — source from a local-only file, never commit
|
|
105
|
+
[[ -f .envrc.local ]] && source_env .envrc.local
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Add `.envrc.local` to `.gitignore` and put any actual secrets there (API keys, DB passwords). Run `direnv allow` once after creating or editing `.envrc`; `direnv` will refuse to load until you do. The moment you `cd` out of the project, all variables unload — no pollution.
|
|
109
|
+
|
|
110
|
+
### pre-commit hooks
|
|
111
|
+
|
|
112
|
+
`pre-commit` runs a configured set of checks every time you `git commit`. Keep the hook list short and fast — anything slower than a second or two trains you to use `--no-verify`, which defeats the point. For data science the right starter set is format, lint, and a couple of sanity hooks.
|
|
113
|
+
|
|
114
|
+
```yaml
|
|
115
|
+
# .pre-commit-config.yaml
|
|
116
|
+
repos:
|
|
117
|
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
118
|
+
rev: v0.6.9
|
|
119
|
+
hooks:
|
|
120
|
+
- id: ruff-format
|
|
121
|
+
- id: ruff-check
|
|
122
|
+
args: [--fix]
|
|
123
|
+
|
|
124
|
+
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
125
|
+
rev: v5.0.0
|
|
126
|
+
hooks:
|
|
127
|
+
- id: end-of-file-fixer
|
|
128
|
+
- id: trailing-whitespace
|
|
129
|
+
- id: check-yaml
|
|
130
|
+
- id: check-added-large-files
|
|
131
|
+
args: [--maxkb=500] # block accidental dataset commits
|
|
132
|
+
|
|
133
|
+
- repo: https://github.com/kynan/nbstripout
|
|
134
|
+
rev: 0.7.1
|
|
135
|
+
hooks:
|
|
136
|
+
- id: nbstripout # strips notebook outputs before commit
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Install the git hook once per clone:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
uv run pre-commit install
|
|
143
|
+
uv run pre-commit run --all-files # bootstrap: fix everything now
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Deliberately excluded from this set: `mypy`, `pytest`, and `bandit`. They are all valuable, but they are slow enough that they belong in CI, not in the commit path. Fast local, thorough remote.
|
|
147
|
+
|
|
148
|
+
### When to add Docker
|
|
149
|
+
|
|
150
|
+
For greenfield solo data science, Docker is overhead you do not need. `uv sync` already gives you reproducibility on any machine with the same OS, and local iteration is faster without a container layer.
|
|
151
|
+
|
|
152
|
+
Reach for Docker only when one of these is true:
|
|
153
|
+
|
|
154
|
+
- **OS mismatch between dev and prod.** You develop on macOS but the model runs on Linux in production, and a native dependency (e.g. a C extension, a specific `libgomp`) behaves differently across platforms.
|
|
155
|
+
- **GPU / CUDA dependencies.** CUDA toolkit versions are tightly coupled to driver versions and OS. A pinned `nvidia/cuda` base image is the only sane way to guarantee training reproducibility across machines.
|
|
156
|
+
- **Handoff to MLOps or serving infra.** Production deployment targets (SageMaker, Vertex, KServe, plain Kubernetes) expect a container. Build one at the handoff boundary, not before.
|
|
157
|
+
- **Onboarding collaborators with hostile local setups.** A Windows colleague who cannot install `uv` natively is a reasonable reason to ship a devcontainer.
|
|
158
|
+
|
|
159
|
+
When you do add Docker, keep it thin: copy `pyproject.toml` and `uv.lock`, run `uv sync --frozen`, and let the same lockfile drive both local and container builds. That way the container is a packaging detail, not a parallel source of truth.
|