@zigrivers/scaffold 3.21.0 → 3.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (124) hide show
  1. package/README.md +21 -7
  2. package/content/knowledge/data-science/README.md +23 -0
  3. package/content/knowledge/data-science/data-science-architecture.md +163 -0
  4. package/content/knowledge/data-science/data-science-conventions.md +233 -0
  5. package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
  6. package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
  7. package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
  8. package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
  9. package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
  10. package/content/knowledge/data-science/data-science-observability.md +161 -0
  11. package/content/knowledge/data-science/data-science-project-structure.md +178 -0
  12. package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
  13. package/content/knowledge/data-science/data-science-requirements.md +151 -0
  14. package/content/knowledge/data-science/data-science-security.md +151 -0
  15. package/content/knowledge/data-science/data-science-testing.md +183 -0
  16. package/content/knowledge/ml/README.md +10 -0
  17. package/content/methodology/data-science-overlay.yml +39 -0
  18. package/dist/cli/commands/dashboard.d.ts.map +1 -1
  19. package/dist/cli/commands/dashboard.js +40 -0
  20. package/dist/cli/commands/dashboard.js.map +1 -1
  21. package/dist/config/schema.d.ts +672 -126
  22. package/dist/config/schema.d.ts.map +1 -1
  23. package/dist/config/schema.js +8 -0
  24. package/dist/config/schema.js.map +1 -1
  25. package/dist/config/schema.test.js +2 -2
  26. package/dist/config/schema.test.js.map +1 -1
  27. package/dist/config/validators/data-science.d.ts +4 -0
  28. package/dist/config/validators/data-science.d.ts.map +1 -0
  29. package/dist/config/validators/data-science.js +15 -0
  30. package/dist/config/validators/data-science.js.map +1 -0
  31. package/dist/config/validators/index.d.ts.map +1 -1
  32. package/dist/config/validators/index.js +2 -0
  33. package/dist/config/validators/index.js.map +1 -1
  34. package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
  35. package/dist/core/assembly/knowledge-loader.js +6 -0
  36. package/dist/core/assembly/knowledge-loader.js.map +1 -1
  37. package/dist/core/assembly/knowledge-loader.test.js +34 -0
  38. package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
  39. package/dist/dashboard/dependency-graph.d.ts +19 -0
  40. package/dist/dashboard/dependency-graph.d.ts.map +1 -0
  41. package/dist/dashboard/dependency-graph.js +180 -0
  42. package/dist/dashboard/dependency-graph.js.map +1 -0
  43. package/dist/dashboard/dependency-graph.test.d.ts +2 -0
  44. package/dist/dashboard/dependency-graph.test.d.ts.map +1 -0
  45. package/dist/dashboard/dependency-graph.test.js +409 -0
  46. package/dist/dashboard/dependency-graph.test.js.map +1 -0
  47. package/dist/dashboard/generator.d.ts +46 -0
  48. package/dist/dashboard/generator.d.ts.map +1 -1
  49. package/dist/dashboard/generator.js +1 -0
  50. package/dist/dashboard/generator.js.map +1 -1
  51. package/dist/dashboard/multi-service.test.js +257 -1
  52. package/dist/dashboard/multi-service.test.js.map +1 -1
  53. package/dist/dashboard/template.d.ts +13 -0
  54. package/dist/dashboard/template.d.ts.map +1 -1
  55. package/dist/dashboard/template.js +176 -0
  56. package/dist/dashboard/template.js.map +1 -1
  57. package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts +2 -0
  58. package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts.map +1 -0
  59. package/dist/e2e/dashboard-cross-service-graph-wiring.test.js +130 -0
  60. package/dist/e2e/dashboard-cross-service-graph-wiring.test.js.map +1 -0
  61. package/dist/e2e/dashboard-cross-service-graph.test.d.ts +2 -0
  62. package/dist/e2e/dashboard-cross-service-graph.test.d.ts.map +1 -0
  63. package/dist/e2e/dashboard-cross-service-graph.test.js +216 -0
  64. package/dist/e2e/dashboard-cross-service-graph.test.js.map +1 -0
  65. package/dist/e2e/project-type-overlays.test.js +73 -0
  66. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  67. package/dist/project/adopt.d.ts.map +1 -1
  68. package/dist/project/adopt.js +3 -1
  69. package/dist/project/adopt.js.map +1 -1
  70. package/dist/project/detectors/coverage.test.d.ts +2 -0
  71. package/dist/project/detectors/coverage.test.d.ts.map +1 -0
  72. package/dist/project/detectors/coverage.test.js +78 -0
  73. package/dist/project/detectors/coverage.test.js.map +1 -0
  74. package/dist/project/detectors/data-science.d.ts +4 -0
  75. package/dist/project/detectors/data-science.d.ts.map +1 -0
  76. package/dist/project/detectors/data-science.js +32 -0
  77. package/dist/project/detectors/data-science.js.map +1 -0
  78. package/dist/project/detectors/data-science.test.d.ts +2 -0
  79. package/dist/project/detectors/data-science.test.d.ts.map +1 -0
  80. package/dist/project/detectors/data-science.test.js +62 -0
  81. package/dist/project/detectors/data-science.test.js.map +1 -0
  82. package/dist/project/detectors/disambiguate.d.ts +2 -0
  83. package/dist/project/detectors/disambiguate.d.ts.map +1 -1
  84. package/dist/project/detectors/disambiguate.js +3 -2
  85. package/dist/project/detectors/disambiguate.js.map +1 -1
  86. package/dist/project/detectors/disambiguate.test.js +10 -1
  87. package/dist/project/detectors/disambiguate.test.js.map +1 -1
  88. package/dist/project/detectors/index.d.ts.map +1 -1
  89. package/dist/project/detectors/index.js +2 -0
  90. package/dist/project/detectors/index.js.map +1 -1
  91. package/dist/project/detectors/library.d.ts.map +1 -1
  92. package/dist/project/detectors/library.js +1 -0
  93. package/dist/project/detectors/library.js.map +1 -1
  94. package/dist/project/detectors/resolve-detection.test.js +31 -0
  95. package/dist/project/detectors/resolve-detection.test.js.map +1 -1
  96. package/dist/project/detectors/types.d.ts +6 -2
  97. package/dist/project/detectors/types.d.ts.map +1 -1
  98. package/dist/project/detectors/types.js.map +1 -1
  99. package/dist/types/config.d.ts +8 -1
  100. package/dist/types/config.d.ts.map +1 -1
  101. package/dist/wizard/copy/core.d.ts.map +1 -1
  102. package/dist/wizard/copy/core.js +4 -0
  103. package/dist/wizard/copy/core.js.map +1 -1
  104. package/dist/wizard/copy/data-science.d.ts +3 -0
  105. package/dist/wizard/copy/data-science.d.ts.map +1 -0
  106. package/dist/wizard/copy/data-science.js +15 -0
  107. package/dist/wizard/copy/data-science.js.map +1 -0
  108. package/dist/wizard/copy/index.d.ts.map +1 -1
  109. package/dist/wizard/copy/index.js +2 -0
  110. package/dist/wizard/copy/index.js.map +1 -1
  111. package/dist/wizard/copy/types.d.ts +5 -1
  112. package/dist/wizard/copy/types.d.ts.map +1 -1
  113. package/dist/wizard/copy/types.test-d.js +7 -0
  114. package/dist/wizard/copy/types.test-d.js.map +1 -1
  115. package/dist/wizard/questions.d.ts +2 -1
  116. package/dist/wizard/questions.d.ts.map +1 -1
  117. package/dist/wizard/questions.js +9 -1
  118. package/dist/wizard/questions.js.map +1 -1
  119. package/dist/wizard/questions.test.js +14 -0
  120. package/dist/wizard/questions.test.js.map +1 -1
  121. package/dist/wizard/wizard.d.ts.map +1 -1
  122. package/dist/wizard/wizard.js +1 -0
  123. package/dist/wizard/wizard.js.map +1 -1
  124. package/package.json +1 -1
@@ -0,0 +1,178 @@
1
+ ---
2
+ name: data-science-project-structure
3
+ description: Opinionated directory layout for solo and small-team data-science projects — notebooks, src, data, models, reports, tests, configs — with a promotion path from exploration to tested modules
4
+ topics: [data-science, project-structure, layout]
5
+ ---
6
+
7
+ A solo data-science project accumulates artifacts faster than most software: half-finished notebooks, CSV dumps, parquet caches, serialized models, PNG charts, and the occasional markdown write-up. Without a deliberate directory structure, the project turns into a folder of 40 loose files within a month and a new contributor — including future-you — cannot tell what is canonical, what is scratch, and what is safe to delete. A clear layout fixes three problems at once: discoverability (where does X live?), git hygiene (what is tracked vs generated?), and the promotion path (how does throwaway notebook code become tested library code?).
8
+
9
+ ## Summary
10
+
11
+ A solo DS project has six top-level directories that each answer one question: `notebooks/` (exploration), `src/` (importable Python modules), `data/` (split into raw/interim/processed — `data/raw/` is always gitignored; small processed artifacts may be committed or DVC-tracked), `models/` (serialized artifacts, tracked via DVC or git-lfs), `reports/` (rendered outputs — figures, HTML, markdown), and `tests/` (pytest suite mirroring `src/`). `configs/` holds YAML run parameters, and `pyproject.toml` at the root defines the package. The `.gitignore` excludes raw data, most of `models/`, and common binary formats that were not deliberately promoted. Reusable logic follows a strict promotion path: explored in a notebook, extracted into `src/`, unit-tested in `tests/`, then re-imported by notebooks or pipeline scripts.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Top-level layout
16
+
17
+ ```
18
+ project-root/
19
+ ├── notebooks/ # Exploratory notebooks (Marimo preferred; numbered chronologically)
20
+ ├── src/ # Importable Python modules — the library
21
+ │ └── <project>/
22
+ │ ├── __init__.py
23
+ │ ├── ingestion.py # Load raw data from source (CSV, DB, API)
24
+ │ ├── features.py # Feature engineering / transforms
25
+ │ ├── training.py # Model fitting routines
26
+ │ ├── evaluation.py # Metrics, CV loops, slice analysis
27
+ │ └── serving.py # Inference helpers (load artifact, predict)
28
+ ├── data/ # Datasets at every pipeline stage
29
+ │ ├── raw/ # Immutable inputs — GITIGNORED (always)
30
+ │ ├── interim/ # Cached intermediates — small Parquet may be committed
31
+ │ └── processed/ # Analysis-ready — usually DVC-tracked; small files may be committed
32
+ ├── models/ # Serialized model artifacts (DVC / git-lfs tracked)
33
+ ├── reports/ # Rendered output: figures/, HTML reports, markdown summaries
34
+ │ └── figures/
35
+ ├── tests/ # pytest suite — mirrors src/ structure
36
+ ├── configs/ # YAML run configs (Hydra-style or plain)
37
+ ├── pyproject.toml # Package metadata, dependencies, tool config
38
+ ├── .gitignore
39
+ └── README.md
40
+ ```
41
+
42
+ One-liners per dir:
43
+ - `notebooks/` — exploration, EDA, prototyping; numbered `01-…`, `02-…` so ordering is obvious
44
+ - `src/` — every reusable function that a second notebook or a pipeline script will call
45
+ - `data/` — all datasets at every stage; raw is always gitignored, selected processed artifacts (small Parquet in `data/interim/` or `data/processed/`) may be committed directly or tracked via DVC — see `data-science-data-versioning`
46
+ - `models/` — trained model artifacts; tracked through DVC or git-lfs pointers, never raw binaries
47
+ - `reports/` — things a human reads: charts, HTML reports, markdown summaries
48
+ - `tests/` — pytest tests for code in `src/`
49
+ - `configs/` — experiment parameters (paths, seeds, hyperparams) separate from code
50
+
51
+ ### Data: gitignore raw, deliberately admit small processed artifacts
52
+
53
+ The single hardest rule in DS project hygiene: **never commit raw datasets under `data/raw/` or raw model binaries under `models/` to git**. A 200 MB parquet file committed to history is permanent — `git filter-repo` is the only cure and it rewrites every commit. Prevent the problem at the `.gitignore` layer before it happens.
54
+
55
+ Gitignoring the entire `data/` tree is the safest default, but it under-serves a common small-team workflow: a cleaned, analysis-ready Parquet in `data/interim/` that's <10 MB, changes rarely, and is useful to have alongside the code. See `data-science-data-versioning` for the full size-based decision rule. The pattern below gitignores raw data and external copies wholesale, and allows opt-in commits of small processed Parquet through a deliberate un-ignore rule. Anything larger (>50 MB, frequent churn, binary artifacts) goes through DVC or git-lfs instead — never direct git commits.
56
+
57
+ ```gitignore
58
+ # Raw / external data — never committed (bulky, usually not redistributable)
59
+ data/raw/
60
+ data/external/
61
+
62
+ # Processed / interim data — default: ignore; opt in to specific small artifacts below
63
+ data/interim/*
64
+ data/processed/*
65
+ !data/.gitkeep
66
+ !data/interim/.gitkeep
67
+ !data/processed/.gitkeep
68
+ # Allow small cleaned Parquet to be committed (see data-science-data-versioning
69
+ # for size guidance — under ~10 MB, rare changes). Larger artifacts belong in
70
+ # DVC or git-lfs.
71
+ !data/interim/*.parquet
72
+ !data/processed/*.parquet
73
+
74
+ # Model artifacts — tracked via DVC or git-lfs, not raw binaries
75
+ models/
76
+ !models/.gitkeep
77
+ !models/**/*.dvc
78
+
79
+ # Common large binary formats (defense in depth — catch anything dropped elsewhere)
80
+ *.feather
81
+ *.joblib
82
+ *.pt
83
+ *.pth
84
+ *.onnx
85
+ *.h5
86
+ *.hdf5
87
+ *.npy
88
+ *.npz
89
+
90
+ # Python
91
+ __pycache__/
92
+ *.pyc
93
+ .venv/
94
+ .ruff_cache/
95
+ .pytest_cache/
96
+ *.egg-info/
97
+
98
+ # Notebook outputs (if not using a tool that strips them)
99
+ .ipynb_checkpoints/
100
+
101
+ # Environment / secrets
102
+ .env
103
+ .env.*
104
+ !.env.example
105
+ ```
106
+
107
+ Two things are load-bearing in this snippet. First, `*.parquet` is **not** in the blanket block-list — we want `data/interim/*.parquet` to match as "allowed" once the un-ignore rules kick in. Second, the `!data/interim/*.parquet` and `!data/processed/*.parquet` patterns mean processed Parquet is committable **by default** at this layer; the policy choice of whether to actually commit a given file is made at `git add` time, not in `.gitignore`. If your team's policy is DVC-first for every dataset, drop those `!…*.parquet` lines. The `!data/.gitkeep` family keeps the directories present in fresh clones.
108
+
109
+ For versioned datasets and models, see `data-science-data-versioning` — DVC or git-lfs pointers are committed, the binaries themselves live in remote storage. Prefer `joblib` or framework-native formats (`.pt`, `.onnx`) over stdlib pickle for model artifacts — pickle loads execute arbitrary code, so a model file from an untrusted source becomes an RCE vector.
110
+
111
+ ### Notebooks → src/ promotion
112
+
113
+ Notebooks are for exploration, not production. The moment a function in a notebook becomes useful to a second notebook — or looks like it will survive longer than the current sitting — it gets promoted:
114
+
115
+ 1. **Identify**: a cell (or few cells) encapsulating reusable logic — a loader, a transform, a metric computation
116
+ 2. **Extract**: move the function into the appropriate `src/<project>/` module (`ingestion.py`, `features.py`, etc.) with type hints and a docstring
117
+ 3. **Test**: add a pytest case in `tests/` that exercises a representative input → output case
118
+ 4. **Re-import**: the notebook now does `from <project>.features import clean_customer_ids` instead of defining the function inline
119
+
120
+ This discipline keeps notebooks short (exploration, narrative, charts) and concentrates correctness-critical code where it can be reviewed, tested, and reused. See `notebook-discipline` for the mechanics of cell size, output clearing, and `%autoreload` so edits in `src/` are picked up in the notebook without a kernel restart.
121
+
122
+ ### Configs and reproducibility
123
+
124
+ Hard-coded paths and hyperparameters inside notebook cells are the single biggest reproducibility killer in a DS project. Push them into `configs/` so a run is defined by a config file + a git SHA.
125
+
126
+ ```yaml
127
+ # configs/train_baseline.yaml
128
+ run_name: baseline_v1
129
+ seed: 42
130
+
131
+ data:
132
+ raw_path: data/raw/transactions_2024.csv
133
+ processed_path: data/processed/transactions_clean.parquet
134
+ target: churned_30d
135
+ test_size: 0.2
136
+ split_seed: 42
137
+
138
+ features:
139
+ include:
140
+ - tenure_days
141
+ - monthly_spend
142
+ - support_tickets_30d
143
+ log_transform:
144
+ - monthly_spend
145
+
146
+ model:
147
+ type: gradient_boosting
148
+ params:
149
+ n_estimators: 200
150
+ max_depth: 5
151
+ learning_rate: 0.05
152
+
153
+ output:
154
+ model_path: models/baseline_v1.joblib
155
+ report_path: reports/baseline_v1.html
156
+ ```
157
+
158
+ Training code reads the config with `yaml.safe_load` (or Hydra / pydantic-settings for richer projects) and a teammate can reproduce the run with `python -m <project>.training --config configs/train_baseline.yaml`. For Hydra specifically, configs split into `configs/data/`, `configs/model/`, `configs/training/` and compose at the command line.
159
+
160
+ ### Tests layout
161
+
162
+ `tests/` mirrors `src/` one-to-one. If `src/<project>/features.py` defines `clean_customer_ids`, then `tests/test_features.py` contains `test_clean_customer_ids_strips_whitespace` and friends.
163
+
164
+ ```
165
+ tests/
166
+ ├── conftest.py # Shared fixtures (tiny sample dataframes, tmp_path helpers)
167
+ ├── test_ingestion.py # Tests for src/<project>/ingestion.py
168
+ ├── test_features.py # Tests for src/<project>/features.py
169
+ ├── test_training.py # Tests for src/<project>/training.py — usually smoke tests
170
+ └── test_evaluation.py # Tests for src/<project>/evaluation.py
171
+ ```
172
+
173
+ Naming rules:
174
+ - Test files: `test_<module>.py` — pytest discovers these by default
175
+ - Test functions: `test_<unit>_<behavior>` — e.g. `test_clean_customer_ids_strips_whitespace`, `test_load_transactions_raises_on_missing_file`
176
+ - Fixtures live in `conftest.py` at the `tests/` root when shared across files; local fixtures stay in the file that uses them
177
+
178
+ Training and evaluation tests are typically **smoke tests** over a 10-row fixture dataframe, not full-dataset runs — the goal is catching shape/dtype/column regressions, not validating model quality (model quality belongs in the evaluation report, not the unit test suite).
@@ -0,0 +1,164 @@
1
+ ---
2
+ name: data-science-reproducibility
3
+ description: Reproducibility for solo/small-team DS — pin deps with uv lock, seed everything, set PYTHONHASHSEED, and reach for Docker only at OS boundaries
4
+ topics: [data-science, reproducibility, determinism, uv, docker]
5
+ ---
6
+
7
+ You show a result in Monday's meeting. Six months later, on a new laptop, you can't reproduce it. Three things usually cause this: dependencies drifted (a minor NumPy release changed a default), randomness wasn't pinned (a shuffle or init picked a different seed), or the data changed underneath you. Reproducibility is the discipline of eliminating all three so the same inputs always produce the same numbers.
8
+
9
+ ## Summary
10
+
11
+ Pin dependencies with `uv lock` and commit `uv.lock` — `uv sync --frozen` rebuilds the exact environment anywhere. Control randomness with a single `set_seed(seed)` helper that seeds Python `random`, NumPy, PyTorch, and TensorFlow at the top of every script. Export `PYTHONHASHSEED=0` via `.envrc` so hash-order is deterministic across interpreter runs. Log the git SHA and data hash with every run so you can walk back to the exact code + data that produced any number. Reach for Docker only when you're crossing an OS or CUDA boundary — for greenfield solo work, `uv sync` is enough.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Pinning dependencies with uv
16
+
17
+ `uv` resolves the full transitive dependency graph into `uv.lock`, which records the exact version and content hash of every package, including transitive deps you never directly imported. Commit it. On a new machine, `uv sync --frozen` reproduces the environment byte-for-byte without re-resolving anything.
18
+
19
+ ```bash
20
+ # First time: declare top-level deps in pyproject.toml, then lock
21
+ uv lock
22
+
23
+ # On any machine (CI, teammate's laptop, 6 months later):
24
+ uv sync --frozen # install exactly what's in uv.lock, never re-resolve
25
+
26
+ # Upgrade a single package intentionally:
27
+ uv lock --upgrade-package numpy
28
+ # Review the lock diff in PR. Re-run your eval suite before merging.
29
+
30
+ # Add a new dependency:
31
+ uv add pandas # updates pyproject.toml AND uv.lock atomically
32
+ ```
33
+
34
+ Rules:
35
+ - Commit `uv.lock`. It is not a build artifact; it is a reproducibility contract.
36
+ - Use `--frozen` in CI and release scripts. A silent re-resolve on deploy is the bug you're trying to prevent.
37
+ - Upgrade packages one at a time, with a PR and an eval run. Bulk upgrades hide which bump broke your metrics.
38
+ - Pin the Python version too: add `requires-python = "==3.12.*"` in `pyproject.toml` and let uv install and manage the interpreter. Minor Python versions change float formatting, dict ordering guarantees, and stdlib behavior in ways that can move your numbers.
39
+
40
+ ### Seed management
41
+
42
+ Every source of randomness in your stack has its own PRNG. Seed all of them from a single call, at the top of every train/eval/predict entry point.
43
+
44
+ ```python
45
+ # src/utils/seed.py
46
+ import os
47
+ import random
48
+ import numpy as np
49
+
50
+ def set_seed(seed: int = 42) -> None:
51
+ """Seed every PRNG we might touch. Call at the top of every script."""
52
+ os.environ["PYTHONHASHSEED"] = str(seed)
53
+ random.seed(seed)
54
+ np.random.seed(seed)
55
+
56
+ try:
57
+ import torch
58
+ torch.manual_seed(seed)
59
+ if torch.cuda.is_available():
60
+ torch.cuda.manual_seed_all(seed)
61
+ except ImportError:
62
+ pass
63
+
64
+ try:
65
+ import tensorflow as tf
66
+ tf.random.set_seed(seed)
67
+ except ImportError:
68
+ pass
69
+ ```
70
+
71
+ Call `set_seed(42)` before any data split, model init, or sampling. If a library accepts a `random_state` argument (scikit-learn does almost everywhere), pass the seed explicitly — global seeding is a safety net, not a substitute.
72
+
73
+ ```python
74
+ # Explicit is better than implicit:
75
+ from sklearn.model_selection import train_test_split
76
+ from sklearn.ensemble import RandomForestClassifier
77
+
78
+ set_seed(42) # global safety net
79
+
80
+ X_train, X_test, y_train, y_test = train_test_split(
81
+ X, y, test_size=0.2, random_state=42 # explicit
82
+ )
83
+ model = RandomForestClassifier(random_state=42) # explicit
84
+ ```
85
+
86
+ The one gotcha: multi-worker DataLoaders in PyTorch spawn subprocesses that need their own seeding. Pass `worker_init_fn` to seed each worker, or you'll get different augmentation sequences across runs even with `set_seed` called in the main process.
87
+
88
+ ### Hash determinism
89
+
90
+ Python randomizes the hash seed per interpreter run by default. That means dict iteration order, set iteration order, and anything that depends on `hash()` varies between runs — a subtle reproducibility leak that only shows up when you try to diff two training runs.
91
+
92
+ ```bash
93
+ # .envrc (direnv)
94
+ export PYTHONHASHSEED=0
95
+ ```
96
+
97
+ `set_seed()` sets this too, but exporting it in `.envrc` covers everything in the shell session — notebooks, ad-hoc scripts, the test runner — before any Python code runs.
98
+
99
+ ### GPU determinism (brief)
100
+
101
+ Full GPU determinism requires cuDNN-level flags and disabling non-deterministic kernels:
102
+
103
+ ```python
104
+ # Only if you actually need this:
105
+ torch.backends.cudnn.deterministic = True
106
+ torch.backends.cudnn.benchmark = False
107
+ ```
108
+
109
+ This has a real performance cost (often 10-30% slower training) and doesn't cover every op. For DS-1, don't chase it. CPU-level determinism from `set_seed()` + pinned deps is enough for 95% of analyses. Reach for GPU determinism only under regulatory requirement, scientific publication, or when debugging a numerics bug that you can't otherwise isolate.
110
+
111
+ ### Git SHA and data versioning
112
+
113
+ A reproducible run needs four things pinned: code, dependencies, randomness, and data. We've covered three. For code, log the git SHA with every experiment (see `data-science-experiment-tracking.md` for the logging pattern — don't duplicate the plumbing here). For data, hash the input dataset or pin a DVC / lakeFS / Git-LFS reference (see `data-science-data-versioning.md`).
114
+
115
+ The minimum metadata for any reported result:
116
+
117
+ ```text
118
+ git_sha: a1b2c3d4
119
+ uv_lock: sha256:... # hash of uv.lock
120
+ seed: 42
121
+ data_hash: sha256:... # hash of the input dataset(s)
122
+ python: 3.12.1
123
+ platform: darwin-arm64
124
+ ```
125
+
126
+ If all five match, the numbers should match. If any differ, you know exactly which knob moved.
127
+
128
+ A working pattern: log these fields into your experiment tracker alongside metrics, and include them in any reported result (paper, slide, dashboard tile). The friction cost is near zero once automated; the debugging cost of a result you can't trace back to its exact code + data is enormous.
129
+
130
+ ### Docker: only at OS boundaries
131
+
132
+ Docker solves a real problem: "it works on my Mac but not on the Linux GPU box." It does not solve "I forgot to commit `uv.lock`." Reach for containers when you're genuinely crossing a boundary:
133
+
134
+ - Developing on macOS, deploying on Linux — native wheels differ, BLAS differs, occasionally results differ.
135
+ - CUDA version mismatch between dev and prod GPUs.
136
+ - A team standardizing a shared prod environment where `uv sync` isn't enough because the base OS libs drift.
137
+
138
+ For a solo greenfield project on one laptop, a Dockerfile is pure overhead. Start with `uv sync --frozen` and add Docker the first time you actually hit a cross-OS reproducibility failure — not before.
139
+
140
+ When you do reach for it, keep the image minimal and derived from your lockfile:
141
+
142
+ ```dockerfile
143
+ FROM python:3.12-slim
144
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
145
+ WORKDIR /app
146
+ COPY pyproject.toml uv.lock ./
147
+ RUN uv sync --frozen --no-dev
148
+ COPY src/ ./src/
149
+ ENV PYTHONHASHSEED=0
150
+ CMD ["uv", "run", "python", "-m", "src.train"]
151
+ ```
152
+
153
+ Pin the base image by digest (`python:3.12-slim@sha256:...`) once the project is in prod — floating tags drift and will silently give you a different glibc next month.
154
+
155
+ ### Reproducibility checklist
156
+
157
+ Before calling any analysis "done":
158
+
159
+ - `uv.lock` committed and current (`uv sync --frozen` in CI succeeds)
160
+ - `set_seed()` called at the top of every entry point
161
+ - `PYTHONHASHSEED=0` in `.envrc` (and `.envrc` committed, `.env` gitignored)
162
+ - Git SHA + data hash logged with every experiment run
163
+ - Eval suite passes on a clean clone in CI — the real test of reproducibility is a fresh machine, not your own
164
+
@@ -0,0 +1,151 @@
1
+ ---
2
+ name: data-science-requirements
3
+ description: Problem framing, success metrics, evaluation-test design, stakeholder contracts, and nonfunctional requirements for solo/small-team data science projects
4
+ topics: [data-science, requirements, evaluation, success-metrics, reproducibility]
5
+ ---
6
+
7
+ As a solo or small-team data scientist without an existing data platform, the single biggest risk to your project is not a bad model — it is ambiguous requirements. Without a tight written spec, a DS project sprawls: the question drifts week to week, the notebook becomes unreproducible, and the stakeholder quietly reinterprets the output. This document defines what "done" looks like for an analytical pipeline, model, or report built from scratch — so you can stop work on time and defend the result.
8
+
9
+ ## Summary
10
+
11
+ A data-science requirements doc states a single well-framed question, one primary success metric with a numeric acceptance threshold declared before any modeling, an evaluation design using held-out data, a stakeholder contract (who consumes the output, in what format, on what cadence), and a nonfunctional budget (reproducibility, runtime, storage). Write the target threshold into a test before you touch training data. If you cannot name the metric and the number, you are not ready to start.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Problem framing
16
+
17
+ Most DS projects fail at step one: the question is fuzzy ("understand churn") rather than decidable ("predict 30-day churn for active paying users, with recall >= 0.6 at precision >= 0.3"). The discipline is to force yourself, in writing, to name the decision the output will drive. If you cannot name that decision, stop and interview the stakeholder until you can.
18
+
19
+ Use a short, copyable problem-statement block at the top of your project README or PRD. The one below is opinionated — it forces every ambiguous field to get filled in before modeling starts. The tradeoff: for pure exploratory work (e.g. a one-off investigation) this is overkill; a 3-line hypothesis is enough.
20
+
21
+ ```yaml
22
+ # docs/problem-statement.yaml
23
+ question: >
24
+ For monthly paying users active in the last 30 days, predict whether they
25
+ will cancel their subscription within the next 30 days.
26
+ decision_driven:
27
+ who: Growth team
28
+ action: Enroll top-decile predicted churners in a retention email campaign
29
+ cadence: Weekly scoring
30
+ unit_of_analysis: user_id x scoring_date
31
+ prediction_target: churn_within_30d (bool)
32
+ out_of_scope:
33
+ - free-tier users
34
+ - annual subscribers
35
+ - users less than 14 days old at scoring time
36
+ known_confounders:
37
+ - planned price change on 2026-05-01
38
+ - seasonality around end-of-year
39
+ ```
40
+
41
+ ### Success metrics
42
+
43
+ State the primary success metric and its acceptance threshold in writing before you train anything. The number comes from the stakeholder contract, not from what the model can achieve — otherwise you are reverse-engineering the bar to whatever you got. Pick one primary metric; secondary metrics are tie-breakers, not co-equals.
44
+
45
+ Typical patterns:
46
+
47
+ - **Predictive model**: one primary metric tied to the downstream decision. For a ranked retention campaign, `recall@top-10%` or `precision@k` beats accuracy or raw AUC, because the campaign can only email the top decile.
48
+ - **Regression / forecast**: `RMSE` in the target's natural unit, plus a naive baseline (last-value, rolling-mean). Beating the baseline is mandatory; if you cannot, the project is not viable.
49
+ - **Analytical pipeline / ETL**: functional correctness plus a p95 runtime budget (e.g. "daily job must finish in < 20 min on the scheduled box").
50
+ - **Report / dashboard**: domain acceptance threshold — the numbers in the report must match an independently computed source-of-truth query within a stated tolerance (e.g. "<= 0.1% deviation from the finance ledger").
51
+
52
+ Encode the success metric as a function so it is unambiguous and testable. The expression below is the whole contract — write it the day you start.
53
+
54
+ ```python
55
+ # src/metrics.py
56
+ from sklearn.metrics import precision_recall_curve
57
+ import numpy as np
58
+
59
+ TARGET_RECALL = 0.60
60
+ MIN_PRECISION = 0.30 # at the threshold that achieves TARGET_RECALL
61
+
62
+ def primary_metric(y_true: np.ndarray, y_score: np.ndarray) -> dict:
63
+ """Primary success metric: precision at the threshold that hits target recall."""
64
+ precision, recall, thresholds = precision_recall_curve(y_true, y_score)
65
+ # Walk from highest threshold down; stop when recall crosses target.
66
+ idx = np.searchsorted(recall[::-1], TARGET_RECALL)
67
+ idx = len(recall) - 1 - idx
68
+ return {
69
+ "recall": float(recall[idx]),
70
+ "precision": float(precision[idx]),
71
+ "threshold": float(thresholds[min(idx, len(thresholds) - 1)]),
72
+ "passes": bool(recall[idx] >= TARGET_RECALL and precision[idx] >= MIN_PRECISION),
73
+ }
74
+ ```
75
+
76
+ ### Evaluation-test design
77
+
78
+ The evaluation test is the single gate between "training run" and "ship it." Its job is to answer one question: does the model hit the stated metric on data it has not seen? Get this wrong — leak the future into the past, evaluate on training rows — and every downstream decision is poisoned.
79
+
80
+ Opinionated defaults:
81
+
82
+ - **Temporal target**: split by time, not randomly. Train on `[t0, t1)`, hold out `[t1, t2)`. Random splits with temporal data leak future information and will silently inflate metrics.
83
+ - **Non-temporal target**: stratified split by the label, fixed `random_state`, held-out fraction 15-20%.
84
+ - **Small data (< 10k rows)**: 5-fold cross-validation with the same fold seed every run; report mean plus std of the primary metric.
85
+ - **Never** tune hyperparameters on the holdout. Use a third validation split or inner CV. Tradeoff: if your dataset is tiny you may have to pool — document the risk explicitly.
86
+
87
+ The evaluation belongs in the test suite, not a notebook. The stakeholder should be able to run `pytest tests/test_model_evaluation.py` and see green before accepting the deliverable.
88
+
89
+ ```python
90
+ # tests/test_model_evaluation.py
91
+ import joblib
92
+ import pandas as pd
93
+ import pytest
94
+ from src.metrics import primary_metric, TARGET_RECALL, MIN_PRECISION
95
+
96
+ HOLDOUT_PATH = "data/holdout_2026_q1.parquet"
97
+ MODEL_PATH = "artifacts/churn_model.pkl"
98
+
99
+ @pytest.fixture(scope="module")
100
+ def scored_holdout():
101
+ df = pd.read_parquet(HOLDOUT_PATH)
102
+ model = joblib.load(MODEL_PATH)
103
+ X = df.drop(columns=["churn_within_30d"])
104
+ y_true = df["churn_within_30d"].to_numpy()
105
+ y_score = model.predict_proba(X)[:, 1]
106
+ return y_true, y_score
107
+
108
+ def test_model_beats_acceptance_threshold(scored_holdout):
109
+ y_true, y_score = scored_holdout
110
+ result = primary_metric(y_true, y_score)
111
+ assert result["passes"], (
112
+ f"Model failed acceptance: recall={result['recall']:.3f} "
113
+ f"(target {TARGET_RECALL}), precision={result['precision']:.3f} "
114
+ f"(min {MIN_PRECISION})"
115
+ )
116
+
117
+ def test_model_beats_naive_baseline(scored_holdout):
118
+ # Baseline: predict global churn rate for everyone. Any real model must beat it.
119
+ y_true, y_score = scored_holdout
120
+ baseline_score = pd.Series([y_true.mean()] * len(y_true)).to_numpy()
121
+ assert primary_metric(y_true, y_score)["precision"] > \
122
+ primary_metric(y_true, baseline_score)["precision"]
123
+ ```
124
+
125
+ ### Stakeholder contract
126
+
127
+ A stakeholder contract makes the hand-off concrete. Without it, you deliver a notebook and the recipient quietly asks for a PDF, a Slack message, a dashboard, or a CSV — all different artifacts. Write this down the same week you write the problem statement.
128
+
129
+ Minimum fields, in order of how often they get skipped:
130
+
131
+ - **Consumer**: named human or team, not "the business."
132
+ - **Artifact format**: one of `csv`, `parquet`, `dashboard (URL)`, `API endpoint`, `PDF report`, `Slack summary`. Pick exactly one primary.
133
+ - **Schema**: column names, types, units, PII flags. Include an example row.
134
+ - **Cadence**: one-shot, daily, weekly, on-demand. If recurring, name the day-of-week and time-of-day.
135
+ - **Freshness SLA**: how stale is the underlying data allowed to be at delivery time.
136
+ - **Failure behavior**: what happens if the pipeline fails — silent retry, page the owner, stale-serve, fail loud.
137
+ - **Sunset criteria**: when does this deliverable stop being needed. If you cannot answer, the project has no natural end.
138
+
139
+ A one-off analysis can collapse this into a single paragraph; a recurring pipeline needs all seven fields in a short `CONTRACT.md` alongside the code.
140
+
141
+ ### Nonfunctional requirements
142
+
143
+ Nonfunctional requirements are what separates a notebook from a deliverable. Three to name explicitly:
144
+
145
+ - **Reproducibility**: the pipeline must produce byte-identical outputs given identical inputs. That means a pinned `requirements.txt` (or `pyproject.toml` + lockfile), explicit `random_state` on every stochastic step (train/test split, model init, shuffling, samplers), a recorded data snapshot (immutable parquet under a dated path, not a mutable SQL query), and an entry-point script that runs end-to-end without manual cells. Test it: delete your local `.venv`, re-clone, run the script, diff the outputs. If they differ, reproducibility is broken. The tradeoff: strict byte-reproducibility is hard on GPU — for deep-learning projects, accept statistical reproducibility (metric within a tolerance) and document the exact hardware/CUDA version.
146
+ - **Runtime budget**: name a wall-clock ceiling for the full pipeline on the hardware you actually have. A useful default for small-team work: "end-to-end run (data pull -> train -> evaluate -> scoring output) must complete in <= 1 hour on a 16GB MacBook Pro." If you blow past it, either simplify or move to a bigger box deliberately — do not let runtime creep silently.
147
+ - **Storage budget**: cap the on-disk footprint of raw data, features, and model artifacts. For laptop-scale work, `< 20 GB` total is a reasonable starting point; over that, you need a deliberate story (external object store, partitioned pulls, sampling). Record the budget in the README and check it in CI with a simple `du -sh` assertion.
148
+
149
+ Encode these as top-of-project invariants, not aspirations. If the model hits the success metric but the pipeline is unreproducible or blows the runtime budget, the project is not done.
150
+
151
+ Taken together, these five sections — problem framing, success metric, evaluation test, stakeholder contract, and nonfunctional budget — form the acceptance spec for the project. Write them up front, commit them alongside the code, and treat any drift as a scope change that requires re-agreeing with the stakeholder.
@@ -0,0 +1,151 @@
1
+ ---
2
+ name: data-science-security
3
+ description: Practical security guardrails for solo / small-team data-science work — PII masking at ingest, credential hygiene with direnv and 1Password, data classification tiers, notebook output stripping, and a note on model memorization
4
+ topics: [data-science, security, pii, secrets, data-classification]
5
+ ---
6
+
7
+ DS work has elevated security risk because analysis code routinely touches raw customer data before anyone has had a chance to sanitize it. A notebook can render real names, emails, and account numbers inline, then get committed to git, emailed to a stakeholder, or pasted into Slack without a second thought. Prediction caches and CSV exports quietly duplicate sensitive rows into `data/` subdirectories. Credentials for warehouses and cloud buckets get dropped into `.env` files or — worse — directly into a notebook cell. The blast radius of a sloppy DS workflow is larger than people assume, and the mitigations are not exotic: they are cheap, boring habits that need to be enforced by tooling.
8
+
9
+ ## Summary
10
+
11
+ Mask `PII` at the ingest boundary so downstream notebooks and logs never see raw identifiers — hash emails, truncate names, drop free-text you do not need. Never commit `secrets`; keep local credentials in a gitignored `direnv` `.envrc.local` or, better, inject them at runtime with `1Password` CLI (`op run --`) so they are never written to disk. Classify every dataset as public / internal / confidential / restricted and let the tier decide where it lives — restricted data stays in the warehouse, confidential gets gitignored, internal lives on a shared drive, public is public. Strip notebook outputs with `nbstripout` as a pre-commit hook (or switch to Marimo's `.py` notebooks, which do not embed outputs at all). For fine-tuned or RAG models, assume training data can leak back out through generations and scrub accordingly.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Handling PII
16
+
17
+ Identify `PII` at the ingest boundary, not inside your analysis code. The rule is: once a column has left the ingest layer, it should either be pseudonymized (hashed, truncated, bucketed) or stripped. Free-text fields (support tickets, chat logs, notes) are the worst offenders — if the analysis does not require them, drop them. If it does, run them through a scrubber like Presidio or a simple regex pass before they land in a DataFrame.
18
+
19
+ Typical categories to handle:
20
+
21
+ - **Direct identifiers** — name, email, phone, SSN, account number, precise address. Hash or drop.
22
+ - **Quasi-identifiers** — ZIP + age + gender can re-identify an individual in a surprisingly small population. Bucket aggressively (age → 10-year bands, ZIP → first 3 digits).
23
+ - **Sensitive attributes** — health, financial, biometric. Treat as restricted (see classification below) and keep out of local files entirely.
24
+ - **Free-text** — run through a scrubber or drop unless the analysis genuinely needs the prose.
25
+
26
+ A minimal masking helper for structured data:
27
+
28
+ ```python
29
+ # src/pii.py
30
+ import hashlib
31
+ import pandas as pd
32
+
33
+ def _hash_email(email: str, salt: str) -> str:
34
+ """Deterministic, salted hash — same email maps to same token for joins."""
35
+ if pd.isna(email):
36
+ return ""
37
+ return hashlib.sha256(f"{salt}:{email.lower().strip()}".encode()).hexdigest()[:16]
38
+
39
+ def mask_customer_frame(df: pd.DataFrame, salt: str) -> pd.DataFrame:
40
+ out = df.copy()
41
+ if "email" in out:
42
+ out["email_id"] = out["email"].map(lambda e: _hash_email(e, salt))
43
+ out = out.drop(columns=["email"])
44
+ if "full_name" in out:
45
+ # keep first initial for rough demographic analysis, drop the rest
46
+ out["name_initial"] = out["full_name"].str[:1]
47
+ out = out.drop(columns=["full_name"])
48
+ # drop anything we never need
49
+ for col in ("phone", "ssn", "address", "dob"):
50
+ if col in out:
51
+ out = out.drop(columns=[col])
52
+ return out
53
+ ```
54
+
55
+ Pair this with a `pandera` schema check on the training-ready DataFrame that asserts sensitive columns are absent — "no bare `email` column, no `ssn` column, no `phone` column." That way a future change that accidentally reintroduces raw PII fails loudly in CI instead of silently:
56
+
57
+ ```python
58
+ import pandera.pandas as pa
59
+
60
+ TrainingSchema = pa.DataFrameSchema(
61
+ columns={
62
+ "email_id": pa.Column(str),
63
+ "name_initial": pa.Column(str, nullable=True),
64
+ "signup_month": pa.Column("datetime64[ns]"),
65
+ },
66
+ strict=True, # reject any column not listed
67
+ )
68
+
69
+ # extra defensive: blacklist raw-PII names in case strict=False is relaxed later
70
+ _FORBIDDEN = {"email", "full_name", "phone", "ssn", "address", "dob"}
71
+ assert not (_FORBIDDEN & set(df.columns)), f"raw PII leaked: {_FORBIDDEN & set(df.columns)}"
72
+ ```
73
+
74
+ Run this check at the boundary between ingest and modeling, and again before anything gets written to a prediction cache or exported as a report.
75
+
76
+ ### Credential hygiene
77
+
78
+ Never commit `secrets`. There are two patterns worth using locally; pick one per project and be consistent.
79
+
80
+ **Pattern 1 — `direnv` with a gitignored `.envrc.local`:**
81
+
82
+ ```bash
83
+ # .envrc (committed — references local overrides)
84
+ dotenv_if_exists .envrc.local
85
+
86
+ # .envrc.local (gitignored — real values live here)
87
+ export WAREHOUSE_URL="postgres://analytics:REAL_PASSWORD@warehouse.internal/prod"
88
+ export AWS_PROFILE="ds-read"
89
+ ```
90
+
91
+ Add `.envrc.local` and `.env*` to `.gitignore`. `direnv` loads these exports automatically when you `cd` into the project.
92
+
93
+ **Pattern 2 — `1Password` CLI with `op run`:**
94
+
95
+ ```bash
96
+ # .env.1password (committed — references, not values)
97
+ WAREHOUSE_URL=op://DS/warehouse-prod/connection_url
98
+ OPENAI_API_KEY=op://DS/openai/api_key
99
+
100
+ # run any command with secrets injected at runtime
101
+ op run --env-file=.env.1password -- python src/train.py
102
+ op run --env-file=.env.1password -- jupyter lab
103
+ ```
104
+
105
+ `op run` substitutes the `op://` references with real values in the child process's environment and never writes them to disk. The committed `.env.1password` file is safe to share because it contains only vault paths, not secrets. This is the stronger pattern when more than one person needs access — you manage grants in 1Password instead of passing `.envrc.local` files around.
106
+
107
+ In production, secrets live in the platform's secret manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) and get injected into the runtime the same way. The governing rule: **if it would go in a `.env` file, it goes in 1Password; if it would go in a secret manager in prod, it stays there — don't duplicate a copy onto your laptop.**
108
+
109
+ A few hygiene rules that follow from this:
110
+
111
+ - Never paste an API key into a notebook cell, even temporarily. Cells get autosaved, checkpointed, and sometimes committed.
112
+ - Never print a credential to logs — wrap secret-carrying objects in types that redact on `__repr__` (Pydantic's `SecretStr`, for example).
113
+ - Rotate any credential that has ever touched your clipboard, a chat window, or a screen share.
114
+ - Run a pre-commit scanner (`gitleaks` or `detect-secrets`) so a stray key cannot get committed even when the `.envrc.local` pattern is ignored.
115
+
116
+ ### Data classification
117
+
118
+ Classify every dataset against a four-tier rubric and let the tier drive storage and access:
119
+
120
+ - **Public** — already on the internet (open datasets, published benchmarks). Can live anywhere, including git.
121
+ - **Internal** — non-sensitive company data (aggregated metrics, anonymized cohorts). Shared private drive or object store with team-level access. Do not commit to git.
122
+ - **Confidential** — business-sensitive but not regulated (revenue breakdowns, customer segments, unreleased product data). Gitignored `data/` directory locally; encrypted bucket with narrow ACL for sharing. Never in notebooks you paste into Slack.
123
+ - **Restricted** — regulated or high-risk PII (health records, payment data, government IDs, raw customer identifiers). Stays in the warehouse or source bucket — **do not download**. Run analysis server-side (dbt model, warehouse notebook, SQL-only pipeline) and only materialize aggregates locally.
124
+
125
+ The mapping matters more than the labels. The point of classification is that "can I keep a CSV of this on my laptop?" has a predetermined answer instead of a per-dataset judgment call made while tired.
126
+
127
+ Record the classification alongside the data — a one-line `data/README.md` entry per source (`customers_raw: restricted, warehouse-only`) is enough. When a new teammate or a future-you adds a pull, the constraint is visible without having to ask.
128
+
129
+ ### Notebook output hygiene
130
+
131
+ A Jupyter `.ipynb` file is a JSON blob that embeds every cell's rendered output, which means a single `df.head()` on a customer table commits 5 real customer rows to git forever. Strip outputs with `nbstripout` as a pre-commit hook:
132
+
133
+ ```yaml
134
+ # .pre-commit-config.yaml
135
+ repos:
136
+ - repo: https://github.com/kynan/nbstripout
137
+ rev: 0.7.1
138
+ hooks:
139
+ - id: nbstripout
140
+ files: \.ipynb$
141
+ ```
142
+
143
+ Install once with `pre-commit install` and every `git commit` scrubs outputs automatically. Pair with a Jupyter config (`jupyter_notebook_config.py`) that disables output saving entirely if you want belt-and-braces.
144
+
145
+ Marimo's `.py`-format notebooks sidestep this problem — they are regular Python files, outputs never get persisted in the notebook, and diffs are reviewable like ordinary code. If you have not picked a notebook format for a new project, prefer Marimo; see `data-science-notebook-discipline` for the broader tradeoffs.
146
+
147
+ Whichever format you pick, also keep prediction caches, CSV exports, and ad-hoc scratch files out of git — a broad `data/` and `outputs/` entry in `.gitignore` prevents the most common leak: a confidential sample dataset getting committed as an "example."
148
+
149
+ ### A word on model memorization
150
+
151
+ Fine-tuned LLMs and RAG systems can reproduce training data verbatim under the right prompt. If your fine-tune corpus or retrieval index contains PII, assume it can leak. Mitigations, in order of strength: scrub PII from the corpus before training or indexing (reuse the masking helper above); host the model privately so prompts and responses stay inside your perimeter; apply output filtering to block regex-detectable identifiers on the way out. Do not fine-tune a public base model on raw customer data and then expose it on an open endpoint — that is the failure mode worth avoiding.