npm - @zigrivers/scaffold - Versions diffs - 3.21.0 → 3.23.0 - Mend

@zigrivers/scaffold 3.21.0 → 3.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (124) hide show

package/README.md +21 -7
package/content/knowledge/data-science/README.md +23 -0
package/content/knowledge/data-science/data-science-architecture.md +163 -0
package/content/knowledge/data-science/data-science-conventions.md +233 -0
package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
package/content/knowledge/data-science/data-science-observability.md +161 -0
package/content/knowledge/data-science/data-science-project-structure.md +178 -0
package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
package/content/knowledge/data-science/data-science-requirements.md +151 -0
package/content/knowledge/data-science/data-science-security.md +151 -0
package/content/knowledge/data-science/data-science-testing.md +183 -0
package/content/knowledge/ml/README.md +10 -0
package/content/methodology/data-science-overlay.yml +39 -0
package/dist/cli/commands/dashboard.d.ts.map +1 -1
package/dist/cli/commands/dashboard.js +40 -0
package/dist/cli/commands/dashboard.js.map +1 -1
package/dist/config/schema.d.ts +672 -126
package/dist/config/schema.d.ts.map +1 -1
package/dist/config/schema.js +8 -0
package/dist/config/schema.js.map +1 -1
package/dist/config/schema.test.js +2 -2
package/dist/config/schema.test.js.map +1 -1
package/dist/config/validators/data-science.d.ts +4 -0
package/dist/config/validators/data-science.d.ts.map +1 -0
package/dist/config/validators/data-science.js +15 -0
package/dist/config/validators/data-science.js.map +1 -0
package/dist/config/validators/index.d.ts.map +1 -1
package/dist/config/validators/index.js +2 -0
package/dist/config/validators/index.js.map +1 -1
package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
package/dist/core/assembly/knowledge-loader.js +6 -0
package/dist/core/assembly/knowledge-loader.js.map +1 -1
package/dist/core/assembly/knowledge-loader.test.js +34 -0
package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
package/dist/dashboard/dependency-graph.d.ts +19 -0
package/dist/dashboard/dependency-graph.d.ts.map +1 -0
package/dist/dashboard/dependency-graph.js +180 -0
package/dist/dashboard/dependency-graph.js.map +1 -0
package/dist/dashboard/dependency-graph.test.d.ts +2 -0
package/dist/dashboard/dependency-graph.test.d.ts.map +1 -0
package/dist/dashboard/dependency-graph.test.js +409 -0
package/dist/dashboard/dependency-graph.test.js.map +1 -0
package/dist/dashboard/generator.d.ts +46 -0
package/dist/dashboard/generator.d.ts.map +1 -1
package/dist/dashboard/generator.js +1 -0
package/dist/dashboard/generator.js.map +1 -1
package/dist/dashboard/multi-service.test.js +257 -1
package/dist/dashboard/multi-service.test.js.map +1 -1
package/dist/dashboard/template.d.ts +13 -0
package/dist/dashboard/template.d.ts.map +1 -1
package/dist/dashboard/template.js +176 -0
package/dist/dashboard/template.js.map +1 -1
package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts +2 -0
package/dist/e2e/dashboard-cross-service-graph-wiring.test.d.ts.map +1 -0
package/dist/e2e/dashboard-cross-service-graph-wiring.test.js +130 -0
package/dist/e2e/dashboard-cross-service-graph-wiring.test.js.map +1 -0
package/dist/e2e/dashboard-cross-service-graph.test.d.ts +2 -0
package/dist/e2e/dashboard-cross-service-graph.test.d.ts.map +1 -0
package/dist/e2e/dashboard-cross-service-graph.test.js +216 -0
package/dist/e2e/dashboard-cross-service-graph.test.js.map +1 -0
package/dist/e2e/project-type-overlays.test.js +73 -0
package/dist/e2e/project-type-overlays.test.js.map +1 -1
package/dist/project/adopt.d.ts.map +1 -1
package/dist/project/adopt.js +3 -1
package/dist/project/adopt.js.map +1 -1
package/dist/project/detectors/coverage.test.d.ts +2 -0
package/dist/project/detectors/coverage.test.d.ts.map +1 -0
package/dist/project/detectors/coverage.test.js +78 -0
package/dist/project/detectors/coverage.test.js.map +1 -0
package/dist/project/detectors/data-science.d.ts +4 -0
package/dist/project/detectors/data-science.d.ts.map +1 -0
package/dist/project/detectors/data-science.js +32 -0
package/dist/project/detectors/data-science.js.map +1 -0
package/dist/project/detectors/data-science.test.d.ts +2 -0
package/dist/project/detectors/data-science.test.d.ts.map +1 -0
package/dist/project/detectors/data-science.test.js +62 -0
package/dist/project/detectors/data-science.test.js.map +1 -0
package/dist/project/detectors/disambiguate.d.ts +2 -0
package/dist/project/detectors/disambiguate.d.ts.map +1 -1
package/dist/project/detectors/disambiguate.js +3 -2
package/dist/project/detectors/disambiguate.js.map +1 -1
package/dist/project/detectors/disambiguate.test.js +10 -1
package/dist/project/detectors/disambiguate.test.js.map +1 -1
package/dist/project/detectors/index.d.ts.map +1 -1
package/dist/project/detectors/index.js +2 -0
package/dist/project/detectors/index.js.map +1 -1
package/dist/project/detectors/library.d.ts.map +1 -1
package/dist/project/detectors/library.js +1 -0
package/dist/project/detectors/library.js.map +1 -1
package/dist/project/detectors/resolve-detection.test.js +31 -0
package/dist/project/detectors/resolve-detection.test.js.map +1 -1
package/dist/project/detectors/types.d.ts +6 -2
package/dist/project/detectors/types.d.ts.map +1 -1
package/dist/project/detectors/types.js.map +1 -1
package/dist/types/config.d.ts +8 -1
package/dist/types/config.d.ts.map +1 -1
package/dist/wizard/copy/core.d.ts.map +1 -1
package/dist/wizard/copy/core.js +4 -0
package/dist/wizard/copy/core.js.map +1 -1
package/dist/wizard/copy/data-science.d.ts +3 -0
package/dist/wizard/copy/data-science.d.ts.map +1 -0
package/dist/wizard/copy/data-science.js +15 -0
package/dist/wizard/copy/data-science.js.map +1 -0
package/dist/wizard/copy/index.d.ts.map +1 -1
package/dist/wizard/copy/index.js +2 -0
package/dist/wizard/copy/index.js.map +1 -1
package/dist/wizard/copy/types.d.ts +5 -1
package/dist/wizard/copy/types.d.ts.map +1 -1
package/dist/wizard/copy/types.test-d.js +7 -0
package/dist/wizard/copy/types.test-d.js.map +1 -1
package/dist/wizard/questions.d.ts +2 -1
package/dist/wizard/questions.d.ts.map +1 -1
package/dist/wizard/questions.js +9 -1
package/dist/wizard/questions.js.map +1 -1
package/dist/wizard/questions.test.js +14 -0
package/dist/wizard/questions.test.js.map +1 -1
package/dist/wizard/wizard.d.ts.map +1 -1
package/dist/wizard/wizard.js +1 -0
package/dist/wizard/wizard.js.map +1 -1
package/package.json +1 -1

package/content/knowledge/data-science/data-science-project-structure.md ADDED Viewed

@@ -0,0 +1,178 @@
+---
+name: data-science-project-structure
+description: Opinionated directory layout for solo and small-team data-science projects — notebooks, src, data, models, reports, tests, configs — with a promotion path from exploration to tested modules
+topics: [data-science, project-structure, layout]
+---
+A solo data-science project accumulates artifacts faster than most software: half-finished notebooks, CSV dumps, parquet caches, serialized models, PNG charts, and the occasional markdown write-up. Without a deliberate directory structure, the project turns into a folder of 40 loose files within a month and a new contributor — including future-you — cannot tell what is canonical, what is scratch, and what is safe to delete. A clear layout fixes three problems at once: discoverability (where does X live?), git hygiene (what is tracked vs generated?), and the promotion path (how does throwaway notebook code become tested library code?).
+## Summary
+A solo DS project has six top-level directories that each answer one question: `notebooks/` (exploration), `src/` (importable Python modules), `data/` (split into raw/interim/processed — `data/raw/` is always gitignored; small processed artifacts may be committed or DVC-tracked), `models/` (serialized artifacts, tracked via DVC or git-lfs), `reports/` (rendered outputs — figures, HTML, markdown), and `tests/` (pytest suite mirroring `src/`). `configs/` holds YAML run parameters, and `pyproject.toml` at the root defines the package. The `.gitignore` excludes raw data, most of `models/`, and common binary formats that were not deliberately promoted. Reusable logic follows a strict promotion path: explored in a notebook, extracted into `src/`, unit-tested in `tests/`, then re-imported by notebooks or pipeline scripts.
+## Deep Guidance
+### Top-level layout
+```
+project-root/
+├── notebooks/          # Exploratory notebooks (Marimo preferred; numbered chronologically)
+├── src/                # Importable Python modules — the library
+│   └── <project>/
+│       ├── __init__.py
+│       ├── ingestion.py    # Load raw data from source (CSV, DB, API)
+│       ├── features.py     # Feature engineering / transforms
+│       ├── training.py     # Model fitting routines
+│       ├── evaluation.py   # Metrics, CV loops, slice analysis
+│       └── serving.py      # Inference helpers (load artifact, predict)
+├── data/               # Datasets at every pipeline stage
+│   ├── raw/            # Immutable inputs — GITIGNORED (always)
+│   ├── interim/        # Cached intermediates — small Parquet may be committed
+│   └── processed/      # Analysis-ready — usually DVC-tracked; small files may be committed
+├── models/             # Serialized model artifacts (DVC / git-lfs tracked)
+├── reports/            # Rendered output: figures/, HTML reports, markdown summaries
+│   └── figures/
+├── tests/              # pytest suite — mirrors src/ structure
+├── configs/            # YAML run configs (Hydra-style or plain)
+├── pyproject.toml      # Package metadata, dependencies, tool config
+├── .gitignore
+└── README.md
+```
+One-liners per dir:
+- `notebooks/` — exploration, EDA, prototyping; numbered `01-…`, `02-…` so ordering is obvious
+- `src/` — every reusable function that a second notebook or a pipeline script will call
+- `data/` — all datasets at every stage; raw is always gitignored, selected processed artifacts (small Parquet in `data/interim/` or `data/processed/`) may be committed directly or tracked via DVC — see `data-science-data-versioning`
+- `models/` — trained model artifacts; tracked through DVC or git-lfs pointers, never raw binaries
+- `reports/` — things a human reads: charts, HTML reports, markdown summaries
+- `tests/` — pytest tests for code in `src/`
+- `configs/` — experiment parameters (paths, seeds, hyperparams) separate from code
+### Data: gitignore raw, deliberately admit small processed artifacts
+The single hardest rule in DS project hygiene: **never commit raw datasets under `data/raw/` or raw model binaries under `models/` to git**. A 200 MB parquet file committed to history is permanent — `git filter-repo` is the only cure and it rewrites every commit. Prevent the problem at the `.gitignore` layer before it happens.
+Gitignoring the entire `data/` tree is the safest default, but it under-serves a common small-team workflow: a cleaned, analysis-ready Parquet in `data/interim/` that's <10 MB, changes rarely, and is useful to have alongside the code. See `data-science-data-versioning` for the full size-based decision rule. The pattern below gitignores raw data and external copies wholesale, and allows opt-in commits of small processed Parquet through a deliberate un-ignore rule. Anything larger (>50 MB, frequent churn, binary artifacts) goes through DVC or git-lfs instead — never direct git commits.
+```gitignore
+# Raw / external data — never committed (bulky, usually not redistributable)
+data/raw/
+data/external/
+# Processed / interim data — default: ignore; opt in to specific small artifacts below
+data/interim/*
+data/processed/*
+!data/.gitkeep
+!data/interim/.gitkeep
+!data/processed/.gitkeep
+# Allow small cleaned Parquet to be committed (see data-science-data-versioning
+# for size guidance — under ~10 MB, rare changes). Larger artifacts belong in
+# DVC or git-lfs.
+!data/interim/*.parquet
+!data/processed/*.parquet
+# Model artifacts — tracked via DVC or git-lfs, not raw binaries
+models/
+!models/.gitkeep
+!models/**/*.dvc
+# Common large binary formats (defense in depth — catch anything dropped elsewhere)
+*.feather
+*.joblib
+*.pt
+*.pth
+*.onnx
+*.h5
+*.hdf5
+*.npy
+*.npz
+# Python
+__pycache__/
+*.pyc
+.venv/
+.ruff_cache/
+.pytest_cache/
+*.egg-info/
+# Notebook outputs (if not using a tool that strips them)
+.ipynb_checkpoints/
+# Environment / secrets
+.env
+.env.*
+!.env.example
+```
+Two things are load-bearing in this snippet. First, `*.parquet` is **not** in the blanket block-list — we want `data/interim/*.parquet` to match as "allowed" once the un-ignore rules kick in. Second, the `!data/interim/*.parquet` and `!data/processed/*.parquet` patterns mean processed Parquet is committable **by default** at this layer; the policy choice of whether to actually commit a given file is made at `git add` time, not in `.gitignore`. If your team's policy is DVC-first for every dataset, drop those `!…*.parquet` lines. The `!data/.gitkeep` family keeps the directories present in fresh clones.
+For versioned datasets and models, see `data-science-data-versioning` — DVC or git-lfs pointers are committed, the binaries themselves live in remote storage. Prefer `joblib` or framework-native formats (`.pt`, `.onnx`) over stdlib pickle for model artifacts — pickle loads execute arbitrary code, so a model file from an untrusted source becomes an RCE vector.
+### Notebooks → src/ promotion
+Notebooks are for exploration, not production. The moment a function in a notebook becomes useful to a second notebook — or looks like it will survive longer than the current sitting — it gets promoted:
+1. **Identify**: a cell (or few cells) encapsulating reusable logic — a loader, a transform, a metric computation
+2. **Extract**: move the function into the appropriate `src/<project>/` module (`ingestion.py`, `features.py`, etc.) with type hints and a docstring
+3. **Test**: add a pytest case in `tests/` that exercises a representative input → output case
+4. **Re-import**: the notebook now does `from <project>.features import clean_customer_ids` instead of defining the function inline
+This discipline keeps notebooks short (exploration, narrative, charts) and concentrates correctness-critical code where it can be reviewed, tested, and reused. See `notebook-discipline` for the mechanics of cell size, output clearing, and `%autoreload` so edits in `src/` are picked up in the notebook without a kernel restart.
+### Configs and reproducibility
+Hard-coded paths and hyperparameters inside notebook cells are the single biggest reproducibility killer in a DS project. Push them into `configs/` so a run is defined by a config file + a git SHA.
+```yaml
+# configs/train_baseline.yaml
+run_name: baseline_v1
+seed: 42
+data:
+  raw_path: data/raw/transactions_2024.csv
+  processed_path: data/processed/transactions_clean.parquet
+  target: churned_30d
+  test_size: 0.2
+  split_seed: 42
+features:
+  include:
+    - tenure_days
+    - monthly_spend
+    - support_tickets_30d
+  log_transform:
+    - monthly_spend
+model:
+  type: gradient_boosting
+  params:
+    n_estimators: 200
+    max_depth: 5
+    learning_rate: 0.05
+output:
+  model_path: models/baseline_v1.joblib
+  report_path: reports/baseline_v1.html
+```
+Training code reads the config with `yaml.safe_load` (or Hydra / pydantic-settings for richer projects) and a teammate can reproduce the run with `python -m <project>.training --config configs/train_baseline.yaml`. For Hydra specifically, configs split into `configs/data/`, `configs/model/`, `configs/training/` and compose at the command line.
+### Tests layout
+`tests/` mirrors `src/` one-to-one. If `src/<project>/features.py` defines `clean_customer_ids`, then `tests/test_features.py` contains `test_clean_customer_ids_strips_whitespace` and friends.
+```
+tests/
+├── conftest.py             # Shared fixtures (tiny sample dataframes, tmp_path helpers)
+├── test_ingestion.py       # Tests for src/<project>/ingestion.py
+├── test_features.py        # Tests for src/<project>/features.py
+├── test_training.py        # Tests for src/<project>/training.py — usually smoke tests
+└── test_evaluation.py      # Tests for src/<project>/evaluation.py
+```
+Naming rules:
+- Test files: `test_<module>.py` — pytest discovers these by default
+- Test functions: `test_<unit>_<behavior>` — e.g. `test_clean_customer_ids_strips_whitespace`, `test_load_transactions_raises_on_missing_file`
+- Fixtures live in `conftest.py` at the `tests/` root when shared across files; local fixtures stay in the file that uses them
+Training and evaluation tests are typically **smoke tests** over a 10-row fixture dataframe, not full-dataset runs — the goal is catching shape/dtype/column regressions, not validating model quality (model quality belongs in the evaluation report, not the unit test suite).

package/content/knowledge/data-science/data-science-reproducibility.md ADDED Viewed

@@ -0,0 +1,164 @@
+---
+name: data-science-reproducibility
+description: Reproducibility for solo/small-team DS — pin deps with uv lock, seed everything, set PYTHONHASHSEED, and reach for Docker only at OS boundaries
+topics: [data-science, reproducibility, determinism, uv, docker]
+---
+You show a result in Monday's meeting. Six months later, on a new laptop, you can't reproduce it. Three things usually cause this: dependencies drifted (a minor NumPy release changed a default), randomness wasn't pinned (a shuffle or init picked a different seed), or the data changed underneath you. Reproducibility is the discipline of eliminating all three so the same inputs always produce the same numbers.
+## Summary
+Pin dependencies with `uv lock` and commit `uv.lock` — `uv sync --frozen` rebuilds the exact environment anywhere. Control randomness with a single `set_seed(seed)` helper that seeds Python `random`, NumPy, PyTorch, and TensorFlow at the top of every script. Export `PYTHONHASHSEED=0` via `.envrc` so hash-order is deterministic across interpreter runs. Log the git SHA and data hash with every run so you can walk back to the exact code + data that produced any number. Reach for Docker only when you're crossing an OS or CUDA boundary — for greenfield solo work, `uv sync` is enough.
+## Deep Guidance
+### Pinning dependencies with uv
+`uv` resolves the full transitive dependency graph into `uv.lock`, which records the exact version and content hash of every package, including transitive deps you never directly imported. Commit it. On a new machine, `uv sync --frozen` reproduces the environment byte-for-byte without re-resolving anything.
+```bash
+# First time: declare top-level deps in pyproject.toml, then lock
+uv lock
+# On any machine (CI, teammate's laptop, 6 months later):
+uv sync --frozen       # install exactly what's in uv.lock, never re-resolve
+# Upgrade a single package intentionally:
+uv lock --upgrade-package numpy
+# Review the lock diff in PR. Re-run your eval suite before merging.
+# Add a new dependency:
+uv add pandas          # updates pyproject.toml AND uv.lock atomically
+```
+Rules:
+- Commit `uv.lock`. It is not a build artifact; it is a reproducibility contract.
+- Use `--frozen` in CI and release scripts. A silent re-resolve on deploy is the bug you're trying to prevent.
+- Upgrade packages one at a time, with a PR and an eval run. Bulk upgrades hide which bump broke your metrics.
+- Pin the Python version too: add `requires-python = "==3.12.*"` in `pyproject.toml` and let uv install and manage the interpreter. Minor Python versions change float formatting, dict ordering guarantees, and stdlib behavior in ways that can move your numbers.
+### Seed management
+Every source of randomness in your stack has its own PRNG. Seed all of them from a single call, at the top of every train/eval/predict entry point.
+```python
+# src/utils/seed.py
+import os
+import random
+import numpy as np
+def set_seed(seed: int = 42) -> None:
+    """Seed every PRNG we might touch. Call at the top of every script."""
+    os.environ["PYTHONHASHSEED"] = str(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+    try:
+        import torch
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(seed)
+    except ImportError:
+        pass
+    try:
+        import tensorflow as tf
+        tf.random.set_seed(seed)
+    except ImportError:
+        pass
+```
+Call `set_seed(42)` before any data split, model init, or sampling. If a library accepts a `random_state` argument (scikit-learn does almost everywhere), pass the seed explicitly — global seeding is a safety net, not a substitute.
+```python
+# Explicit is better than implicit:
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import RandomForestClassifier
+set_seed(42)  # global safety net
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42      # explicit
+)
+model = RandomForestClassifier(random_state=42)  # explicit
+```
+The one gotcha: multi-worker DataLoaders in PyTorch spawn subprocesses that need their own seeding. Pass `worker_init_fn` to seed each worker, or you'll get different augmentation sequences across runs even with `set_seed` called in the main process.
+### Hash determinism
+Python randomizes the hash seed per interpreter run by default. That means dict iteration order, set iteration order, and anything that depends on `hash()` varies between runs — a subtle reproducibility leak that only shows up when you try to diff two training runs.
+```bash
+# .envrc (direnv)
+export PYTHONHASHSEED=0
+```
+`set_seed()` sets this too, but exporting it in `.envrc` covers everything in the shell session — notebooks, ad-hoc scripts, the test runner — before any Python code runs.
+### GPU determinism (brief)
+Full GPU determinism requires cuDNN-level flags and disabling non-deterministic kernels:
+```python
+# Only if you actually need this:
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+```
+This has a real performance cost (often 10-30% slower training) and doesn't cover every op. For DS-1, don't chase it. CPU-level determinism from `set_seed()` + pinned deps is enough for 95% of analyses. Reach for GPU determinism only under regulatory requirement, scientific publication, or when debugging a numerics bug that you can't otherwise isolate.
+### Git SHA and data versioning
+A reproducible run needs four things pinned: code, dependencies, randomness, and data. We've covered three. For code, log the git SHA with every experiment (see `data-science-experiment-tracking.md` for the logging pattern — don't duplicate the plumbing here). For data, hash the input dataset or pin a DVC / lakeFS / Git-LFS reference (see `data-science-data-versioning.md`).
+The minimum metadata for any reported result:
+```text
+git_sha:     a1b2c3d4
+uv_lock:     sha256:...          # hash of uv.lock
+seed:        42
+data_hash:   sha256:...          # hash of the input dataset(s)
+python:      3.12.1
+platform:    darwin-arm64
+```
+If all five match, the numbers should match. If any differ, you know exactly which knob moved.
+A working pattern: log these fields into your experiment tracker alongside metrics, and include them in any reported result (paper, slide, dashboard tile). The friction cost is near zero once automated; the debugging cost of a result you can't trace back to its exact code + data is enormous.
+### Docker: only at OS boundaries
+Docker solves a real problem: "it works on my Mac but not on the Linux GPU box." It does not solve "I forgot to commit `uv.lock`." Reach for containers when you're genuinely crossing a boundary:
+- Developing on macOS, deploying on Linux — native wheels differ, BLAS differs, occasionally results differ.
+- CUDA version mismatch between dev and prod GPUs.
+- A team standardizing a shared prod environment where `uv sync` isn't enough because the base OS libs drift.
+For a solo greenfield project on one laptop, a Dockerfile is pure overhead. Start with `uv sync --frozen` and add Docker the first time you actually hit a cross-OS reproducibility failure — not before.
+When you do reach for it, keep the image minimal and derived from your lockfile:
+```dockerfile
+FROM python:3.12-slim
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+WORKDIR /app
+COPY pyproject.toml uv.lock ./
+RUN uv sync --frozen --no-dev
+COPY src/ ./src/
+ENV PYTHONHASHSEED=0
+CMD ["uv", "run", "python", "-m", "src.train"]
+```
+Pin the base image by digest (`python:3.12-slim@sha256:...`) once the project is in prod — floating tags drift and will silently give you a different glibc next month.
+### Reproducibility checklist
+Before calling any analysis "done":
+- `uv.lock` committed and current (`uv sync --frozen` in CI succeeds)
+- `set_seed()` called at the top of every entry point
+- `PYTHONHASHSEED=0` in `.envrc` (and `.envrc` committed, `.env` gitignored)
+- Git SHA + data hash logged with every experiment run
+- Eval suite passes on a clean clone in CI — the real test of reproducibility is a fresh machine, not your own

package/content/knowledge/data-science/data-science-requirements.md ADDED Viewed

@@ -0,0 +1,151 @@
+---
+name: data-science-requirements
+description: Problem framing, success metrics, evaluation-test design, stakeholder contracts, and nonfunctional requirements for solo/small-team data science projects
+topics: [data-science, requirements, evaluation, success-metrics, reproducibility]
+---
+As a solo or small-team data scientist without an existing data platform, the single biggest risk to your project is not a bad model — it is ambiguous requirements. Without a tight written spec, a DS project sprawls: the question drifts week to week, the notebook becomes unreproducible, and the stakeholder quietly reinterprets the output. This document defines what "done" looks like for an analytical pipeline, model, or report built from scratch — so you can stop work on time and defend the result.
+## Summary
+A data-science requirements doc states a single well-framed question, one primary success metric with a numeric acceptance threshold declared before any modeling, an evaluation design using held-out data, a stakeholder contract (who consumes the output, in what format, on what cadence), and a nonfunctional budget (reproducibility, runtime, storage). Write the target threshold into a test before you touch training data. If you cannot name the metric and the number, you are not ready to start.
+## Deep Guidance
+### Problem framing
+Most DS projects fail at step one: the question is fuzzy ("understand churn") rather than decidable ("predict 30-day churn for active paying users, with recall >= 0.6 at precision >= 0.3"). The discipline is to force yourself, in writing, to name the decision the output will drive. If you cannot name that decision, stop and interview the stakeholder until you can.
+Use a short, copyable problem-statement block at the top of your project README or PRD. The one below is opinionated — it forces every ambiguous field to get filled in before modeling starts. The tradeoff: for pure exploratory work (e.g. a one-off investigation) this is overkill; a 3-line hypothesis is enough.
+```yaml
+# docs/problem-statement.yaml
+question: >
+  For monthly paying users active in the last 30 days, predict whether they
+  will cancel their subscription within the next 30 days.
+decision_driven:
+  who: Growth team
+  action: Enroll top-decile predicted churners in a retention email campaign
+  cadence: Weekly scoring
+unit_of_analysis: user_id x scoring_date
+prediction_target: churn_within_30d (bool)
+out_of_scope:
+  - free-tier users
+  - annual subscribers
+  - users less than 14 days old at scoring time
+known_confounders:
+  - planned price change on 2026-05-01
+  - seasonality around end-of-year
+```
+### Success metrics
+State the primary success metric and its acceptance threshold in writing before you train anything. The number comes from the stakeholder contract, not from what the model can achieve — otherwise you are reverse-engineering the bar to whatever you got. Pick one primary metric; secondary metrics are tie-breakers, not co-equals.
+Typical patterns:
+- **Predictive model**: one primary metric tied to the downstream decision. For a ranked retention campaign, `recall@top-10%` or `precision@k` beats accuracy or raw AUC, because the campaign can only email the top decile.
+- **Regression / forecast**: `RMSE` in the target's natural unit, plus a naive baseline (last-value, rolling-mean). Beating the baseline is mandatory; if you cannot, the project is not viable.
+- **Analytical pipeline / ETL**: functional correctness plus a p95 runtime budget (e.g. "daily job must finish in < 20 min on the scheduled box").
+- **Report / dashboard**: domain acceptance threshold — the numbers in the report must match an independently computed source-of-truth query within a stated tolerance (e.g. "<= 0.1% deviation from the finance ledger").
+Encode the success metric as a function so it is unambiguous and testable. The expression below is the whole contract — write it the day you start.
+```python
+# src/metrics.py
+from sklearn.metrics import precision_recall_curve
+import numpy as np
+TARGET_RECALL = 0.60
+MIN_PRECISION = 0.30  # at the threshold that achieves TARGET_RECALL
+def primary_metric(y_true: np.ndarray, y_score: np.ndarray) -> dict:
+    """Primary success metric: precision at the threshold that hits target recall."""
+    precision, recall, thresholds = precision_recall_curve(y_true, y_score)
+    # Walk from highest threshold down; stop when recall crosses target.
+    idx = np.searchsorted(recall[::-1], TARGET_RECALL)
+    idx = len(recall) - 1 - idx
+    return {
+        "recall": float(recall[idx]),
+        "precision": float(precision[idx]),
+        "threshold": float(thresholds[min(idx, len(thresholds) - 1)]),
+        "passes": bool(recall[idx] >= TARGET_RECALL and precision[idx] >= MIN_PRECISION),
+    }
+```
+### Evaluation-test design
+The evaluation test is the single gate between "training run" and "ship it." Its job is to answer one question: does the model hit the stated metric on data it has not seen? Get this wrong — leak the future into the past, evaluate on training rows — and every downstream decision is poisoned.
+Opinionated defaults:
+- **Temporal target**: split by time, not randomly. Train on `[t0, t1)`, hold out `[t1, t2)`. Random splits with temporal data leak future information and will silently inflate metrics.
+- **Non-temporal target**: stratified split by the label, fixed `random_state`, held-out fraction 15-20%.
+- **Small data (< 10k rows)**: 5-fold cross-validation with the same fold seed every run; report mean plus std of the primary metric.
+- **Never** tune hyperparameters on the holdout. Use a third validation split or inner CV. Tradeoff: if your dataset is tiny you may have to pool — document the risk explicitly.
+The evaluation belongs in the test suite, not a notebook. The stakeholder should be able to run `pytest tests/test_model_evaluation.py` and see green before accepting the deliverable.
+```python
+# tests/test_model_evaluation.py
+import joblib
+import pandas as pd
+import pytest
+from src.metrics import primary_metric, TARGET_RECALL, MIN_PRECISION
+HOLDOUT_PATH = "data/holdout_2026_q1.parquet"
+MODEL_PATH = "artifacts/churn_model.pkl"
+@pytest.fixture(scope="module")
+def scored_holdout():
+    df = pd.read_parquet(HOLDOUT_PATH)
+    model = joblib.load(MODEL_PATH)
+    X = df.drop(columns=["churn_within_30d"])
+    y_true = df["churn_within_30d"].to_numpy()
+    y_score = model.predict_proba(X)[:, 1]
+    return y_true, y_score
+def test_model_beats_acceptance_threshold(scored_holdout):
+    y_true, y_score = scored_holdout
+    result = primary_metric(y_true, y_score)
+    assert result["passes"], (
+        f"Model failed acceptance: recall={result['recall']:.3f} "
+        f"(target {TARGET_RECALL}), precision={result['precision']:.3f} "
+        f"(min {MIN_PRECISION})"
+    )
+def test_model_beats_naive_baseline(scored_holdout):
+    # Baseline: predict global churn rate for everyone. Any real model must beat it.
+    y_true, y_score = scored_holdout
+    baseline_score = pd.Series([y_true.mean()] * len(y_true)).to_numpy()
+    assert primary_metric(y_true, y_score)["precision"] > \
+           primary_metric(y_true, baseline_score)["precision"]
+```
+### Stakeholder contract
+A stakeholder contract makes the hand-off concrete. Without it, you deliver a notebook and the recipient quietly asks for a PDF, a Slack message, a dashboard, or a CSV — all different artifacts. Write this down the same week you write the problem statement.
+Minimum fields, in order of how often they get skipped:
+- **Consumer**: named human or team, not "the business."
+- **Artifact format**: one of `csv`, `parquet`, `dashboard (URL)`, `API endpoint`, `PDF report`, `Slack summary`. Pick exactly one primary.
+- **Schema**: column names, types, units, PII flags. Include an example row.
+- **Cadence**: one-shot, daily, weekly, on-demand. If recurring, name the day-of-week and time-of-day.
+- **Freshness SLA**: how stale is the underlying data allowed to be at delivery time.
+- **Failure behavior**: what happens if the pipeline fails — silent retry, page the owner, stale-serve, fail loud.
+- **Sunset criteria**: when does this deliverable stop being needed. If you cannot answer, the project has no natural end.
+A one-off analysis can collapse this into a single paragraph; a recurring pipeline needs all seven fields in a short `CONTRACT.md` alongside the code.
+### Nonfunctional requirements
+Nonfunctional requirements are what separates a notebook from a deliverable. Three to name explicitly:
+- **Reproducibility**: the pipeline must produce byte-identical outputs given identical inputs. That means a pinned `requirements.txt` (or `pyproject.toml` + lockfile), explicit `random_state` on every stochastic step (train/test split, model init, shuffling, samplers), a recorded data snapshot (immutable parquet under a dated path, not a mutable SQL query), and an entry-point script that runs end-to-end without manual cells. Test it: delete your local `.venv`, re-clone, run the script, diff the outputs. If they differ, reproducibility is broken. The tradeoff: strict byte-reproducibility is hard on GPU — for deep-learning projects, accept statistical reproducibility (metric within a tolerance) and document the exact hardware/CUDA version.
+- **Runtime budget**: name a wall-clock ceiling for the full pipeline on the hardware you actually have. A useful default for small-team work: "end-to-end run (data pull -> train -> evaluate -> scoring output) must complete in <= 1 hour on a 16GB MacBook Pro." If you blow past it, either simplify or move to a bigger box deliberately — do not let runtime creep silently.
+- **Storage budget**: cap the on-disk footprint of raw data, features, and model artifacts. For laptop-scale work, `< 20 GB` total is a reasonable starting point; over that, you need a deliberate story (external object store, partitioned pulls, sampling). Record the budget in the README and check it in CI with a simple `du -sh` assertion.
+Encode these as top-of-project invariants, not aspirations. If the model hits the success metric but the pipeline is unreproducible or blows the runtime budget, the project is not done.
+Taken together, these five sections — problem framing, success metric, evaluation test, stakeholder contract, and nonfunctional budget — form the acceptance spec for the project. Write them up front, commit them alongside the code, and treat any drift as a scope change that requires re-agreeing with the stakeholder.

package/content/knowledge/data-science/data-science-security.md ADDED Viewed

@@ -0,0 +1,151 @@
+---
+name: data-science-security
+description: Practical security guardrails for solo / small-team data-science work — PII masking at ingest, credential hygiene with direnv and 1Password, data classification tiers, notebook output stripping, and a note on model memorization
+topics: [data-science, security, pii, secrets, data-classification]
+---
+DS work has elevated security risk because analysis code routinely touches raw customer data before anyone has had a chance to sanitize it. A notebook can render real names, emails, and account numbers inline, then get committed to git, emailed to a stakeholder, or pasted into Slack without a second thought. Prediction caches and CSV exports quietly duplicate sensitive rows into `data/` subdirectories. Credentials for warehouses and cloud buckets get dropped into `.env` files or — worse — directly into a notebook cell. The blast radius of a sloppy DS workflow is larger than people assume, and the mitigations are not exotic: they are cheap, boring habits that need to be enforced by tooling.
+## Summary
+Mask `PII` at the ingest boundary so downstream notebooks and logs never see raw identifiers — hash emails, truncate names, drop free-text you do not need. Never commit `secrets`; keep local credentials in a gitignored `direnv` `.envrc.local` or, better, inject them at runtime with `1Password` CLI (`op run --`) so they are never written to disk. Classify every dataset as public / internal / confidential / restricted and let the tier decide where it lives — restricted data stays in the warehouse, confidential gets gitignored, internal lives on a shared drive, public is public. Strip notebook outputs with `nbstripout` as a pre-commit hook (or switch to Marimo's `.py` notebooks, which do not embed outputs at all). For fine-tuned or RAG models, assume training data can leak back out through generations and scrub accordingly.
+## Deep Guidance
+### Handling PII
+Identify `PII` at the ingest boundary, not inside your analysis code. The rule is: once a column has left the ingest layer, it should either be pseudonymized (hashed, truncated, bucketed) or stripped. Free-text fields (support tickets, chat logs, notes) are the worst offenders — if the analysis does not require them, drop them. If it does, run them through a scrubber like Presidio or a simple regex pass before they land in a DataFrame.
+Typical categories to handle:
+- **Direct identifiers** — name, email, phone, SSN, account number, precise address. Hash or drop.
+- **Quasi-identifiers** — ZIP + age + gender can re-identify an individual in a surprisingly small population. Bucket aggressively (age → 10-year bands, ZIP → first 3 digits).
+- **Sensitive attributes** — health, financial, biometric. Treat as restricted (see classification below) and keep out of local files entirely.
+- **Free-text** — run through a scrubber or drop unless the analysis genuinely needs the prose.
+A minimal masking helper for structured data:
+```python
+# src/pii.py
+import hashlib
+import pandas as pd
+def _hash_email(email: str, salt: str) -> str:
+    """Deterministic, salted hash — same email maps to same token for joins."""
+    if pd.isna(email):
+        return ""
+    return hashlib.sha256(f"{salt}:{email.lower().strip()}".encode()).hexdigest()[:16]
+def mask_customer_frame(df: pd.DataFrame, salt: str) -> pd.DataFrame:
+    out = df.copy()
+    if "email" in out:
+        out["email_id"] = out["email"].map(lambda e: _hash_email(e, salt))
+        out = out.drop(columns=["email"])
+    if "full_name" in out:
+        # keep first initial for rough demographic analysis, drop the rest
+        out["name_initial"] = out["full_name"].str[:1]
+        out = out.drop(columns=["full_name"])
+    # drop anything we never need
+    for col in ("phone", "ssn", "address", "dob"):
+        if col in out:
+            out = out.drop(columns=[col])
+    return out
+```
+Pair this with a `pandera` schema check on the training-ready DataFrame that asserts sensitive columns are absent — "no bare `email` column, no `ssn` column, no `phone` column." That way a future change that accidentally reintroduces raw PII fails loudly in CI instead of silently:
+```python
+import pandera.pandas as pa
+TrainingSchema = pa.DataFrameSchema(
+    columns={
+        "email_id": pa.Column(str),
+        "name_initial": pa.Column(str, nullable=True),
+        "signup_month": pa.Column("datetime64[ns]"),
+    },
+    strict=True,  # reject any column not listed
+)
+# extra defensive: blacklist raw-PII names in case strict=False is relaxed later
+_FORBIDDEN = {"email", "full_name", "phone", "ssn", "address", "dob"}
+assert not (_FORBIDDEN & set(df.columns)), f"raw PII leaked: {_FORBIDDEN & set(df.columns)}"
+```
+Run this check at the boundary between ingest and modeling, and again before anything gets written to a prediction cache or exported as a report.
+### Credential hygiene
+Never commit `secrets`. There are two patterns worth using locally; pick one per project and be consistent.
+**Pattern 1 — `direnv` with a gitignored `.envrc.local`:**
+```bash
+# .envrc (committed — references local overrides)
+dotenv_if_exists .envrc.local
+# .envrc.local (gitignored — real values live here)
+export WAREHOUSE_URL="postgres://analytics:REAL_PASSWORD@warehouse.internal/prod"
+export AWS_PROFILE="ds-read"
+```
+Add `.envrc.local` and `.env*` to `.gitignore`. `direnv` loads these exports automatically when you `cd` into the project.
+**Pattern 2 — `1Password` CLI with `op run`:**
+```bash
+# .env.1password (committed — references, not values)
+WAREHOUSE_URL=op://DS/warehouse-prod/connection_url
+OPENAI_API_KEY=op://DS/openai/api_key
+# run any command with secrets injected at runtime
+op run --env-file=.env.1password -- python src/train.py
+op run --env-file=.env.1password -- jupyter lab
+```
+`op run` substitutes the `op://` references with real values in the child process's environment and never writes them to disk. The committed `.env.1password` file is safe to share because it contains only vault paths, not secrets. This is the stronger pattern when more than one person needs access — you manage grants in 1Password instead of passing `.envrc.local` files around.
+In production, secrets live in the platform's secret manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) and get injected into the runtime the same way. The governing rule: **if it would go in a `.env` file, it goes in 1Password; if it would go in a secret manager in prod, it stays there — don't duplicate a copy onto your laptop.**
+A few hygiene rules that follow from this:
+- Never paste an API key into a notebook cell, even temporarily. Cells get autosaved, checkpointed, and sometimes committed.
+- Never print a credential to logs — wrap secret-carrying objects in types that redact on `__repr__` (Pydantic's `SecretStr`, for example).
+- Rotate any credential that has ever touched your clipboard, a chat window, or a screen share.
+- Run a pre-commit scanner (`gitleaks` or `detect-secrets`) so a stray key cannot get committed even when the `.envrc.local` pattern is ignored.
+### Data classification
+Classify every dataset against a four-tier rubric and let the tier drive storage and access:
+- **Public** — already on the internet (open datasets, published benchmarks). Can live anywhere, including git.
+- **Internal** — non-sensitive company data (aggregated metrics, anonymized cohorts). Shared private drive or object store with team-level access. Do not commit to git.
+- **Confidential** — business-sensitive but not regulated (revenue breakdowns, customer segments, unreleased product data). Gitignored `data/` directory locally; encrypted bucket with narrow ACL for sharing. Never in notebooks you paste into Slack.
+- **Restricted** — regulated or high-risk PII (health records, payment data, government IDs, raw customer identifiers). Stays in the warehouse or source bucket — **do not download**. Run analysis server-side (dbt model, warehouse notebook, SQL-only pipeline) and only materialize aggregates locally.
+The mapping matters more than the labels. The point of classification is that "can I keep a CSV of this on my laptop?" has a predetermined answer instead of a per-dataset judgment call made while tired.
+Record the classification alongside the data — a one-line `data/README.md` entry per source (`customers_raw: restricted, warehouse-only`) is enough. When a new teammate or a future-you adds a pull, the constraint is visible without having to ask.
+### Notebook output hygiene
+A Jupyter `.ipynb` file is a JSON blob that embeds every cell's rendered output, which means a single `df.head()` on a customer table commits 5 real customer rows to git forever. Strip outputs with `nbstripout` as a pre-commit hook:
+```yaml
+# .pre-commit-config.yaml
+repos:
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.7.1
+    hooks:
+      - id: nbstripout
+        files: \.ipynb$
+```
+Install once with `pre-commit install` and every `git commit` scrubs outputs automatically. Pair with a Jupyter config (`jupyter_notebook_config.py`) that disables output saving entirely if you want belt-and-braces.
+Marimo's `.py`-format notebooks sidestep this problem — they are regular Python files, outputs never get persisted in the notebook, and diffs are reviewable like ordinary code. If you have not picked a notebook format for a new project, prefer Marimo; see `data-science-notebook-discipline` for the broader tradeoffs.
+Whichever format you pick, also keep prediction caches, CSV exports, and ad-hoc scratch files out of git — a broad `data/` and `outputs/` entry in `.gitignore` prevents the most common leak: a confidential sample dataset getting committed as an "example."
+### A word on model memorization
+Fine-tuned LLMs and RAG systems can reproduce training data verbatim under the right prompt. If your fine-tune corpus or retrieval index contains PII, assume it can leak. Mitigations, in order of strength: scrub PII from the corpus before training or indexing (reuse the masking helper above); host the model privately so prompts and responses stay inside your perimeter; apply output filtering to block regex-detectable identifiers on the way out. Do not fine-tune a public base model on raw customer data and then expose it on an open endpoint — that is the failure mode worth avoiding.