npm - @zigrivers/scaffold - Versions diffs - 3.22.0 → 3.24.0 - Mend

@zigrivers/scaffold 3.22.0 → 3.24.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (111) hide show

package/README.md +44 -23
package/content/knowledge/core/automated-review-tooling.md +3 -3
package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
package/content/knowledge/data-science/README.md +23 -0
package/content/knowledge/data-science/data-science-architecture.md +163 -0
package/content/knowledge/data-science/data-science-conventions.md +233 -0
package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
package/content/knowledge/data-science/data-science-observability.md +161 -0
package/content/knowledge/data-science/data-science-project-structure.md +178 -0
package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
package/content/knowledge/data-science/data-science-requirements.md +151 -0
package/content/knowledge/data-science/data-science-security.md +151 -0
package/content/knowledge/data-science/data-science-testing.md +183 -0
package/content/knowledge/ml/README.md +10 -0
package/content/methodology/data-science-overlay.yml +39 -0
package/content/pipeline/build/multi-agent-resume.md +7 -6
package/content/pipeline/build/multi-agent-start.md +7 -6
package/content/pipeline/build/single-agent-resume.md +7 -6
package/content/pipeline/build/single-agent-start.md +7 -6
package/content/pipeline/environment/automated-pr-review.md +79 -27
package/content/skills/mmr/SKILL.md +72 -2
package/content/skills/scaffold-runner/SKILL.md +65 -19
package/content/tools/review-code.md +74 -16
package/content/tools/review-pr.md +25 -6
package/dist/cli/commands/check.d.ts.map +1 -1
package/dist/cli/commands/check.js +28 -17
package/dist/cli/commands/check.js.map +1 -1
package/dist/config/schema.d.ts +672 -126
package/dist/config/schema.d.ts.map +1 -1
package/dist/config/schema.js +8 -0
package/dist/config/schema.js.map +1 -1
package/dist/config/schema.test.js +2 -2
package/dist/config/schema.test.js.map +1 -1
package/dist/config/validators/data-science.d.ts +4 -0
package/dist/config/validators/data-science.d.ts.map +1 -0
package/dist/config/validators/data-science.js +15 -0
package/dist/config/validators/data-science.js.map +1 -0
package/dist/config/validators/index.d.ts.map +1 -1
package/dist/config/validators/index.js +2 -0
package/dist/config/validators/index.js.map +1 -1
package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
package/dist/core/assembly/knowledge-loader.js +6 -0
package/dist/core/assembly/knowledge-loader.js.map +1 -1
package/dist/core/assembly/knowledge-loader.test.js +34 -0
package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
package/dist/e2e/project-type-overlays.test.js +73 -0
package/dist/e2e/project-type-overlays.test.js.map +1 -1
package/dist/project/adopt.d.ts.map +1 -1
package/dist/project/adopt.js +3 -1
package/dist/project/adopt.js.map +1 -1
package/dist/project/detectors/coverage.test.d.ts +2 -0
package/dist/project/detectors/coverage.test.d.ts.map +1 -0
package/dist/project/detectors/coverage.test.js +78 -0
package/dist/project/detectors/coverage.test.js.map +1 -0
package/dist/project/detectors/data-science.d.ts +4 -0
package/dist/project/detectors/data-science.d.ts.map +1 -0
package/dist/project/detectors/data-science.js +32 -0
package/dist/project/detectors/data-science.js.map +1 -0
package/dist/project/detectors/data-science.test.d.ts +2 -0
package/dist/project/detectors/data-science.test.d.ts.map +1 -0
package/dist/project/detectors/data-science.test.js +62 -0
package/dist/project/detectors/data-science.test.js.map +1 -0
package/dist/project/detectors/disambiguate.d.ts +2 -0
package/dist/project/detectors/disambiguate.d.ts.map +1 -1
package/dist/project/detectors/disambiguate.js +3 -2
package/dist/project/detectors/disambiguate.js.map +1 -1
package/dist/project/detectors/disambiguate.test.js +10 -1
package/dist/project/detectors/disambiguate.test.js.map +1 -1
package/dist/project/detectors/index.d.ts.map +1 -1
package/dist/project/detectors/index.js +2 -0
package/dist/project/detectors/index.js.map +1 -1
package/dist/project/detectors/library.d.ts.map +1 -1
package/dist/project/detectors/library.js +1 -0
package/dist/project/detectors/library.js.map +1 -1
package/dist/project/detectors/resolve-detection.test.js +31 -0
package/dist/project/detectors/resolve-detection.test.js.map +1 -1
package/dist/project/detectors/types.d.ts +6 -2
package/dist/project/detectors/types.d.ts.map +1 -1
package/dist/project/detectors/types.js.map +1 -1
package/dist/types/config.d.ts +8 -1
package/dist/types/config.d.ts.map +1 -1
package/dist/wizard/copy/core.d.ts.map +1 -1
package/dist/wizard/copy/core.js +4 -0
package/dist/wizard/copy/core.js.map +1 -1
package/dist/wizard/copy/data-science.d.ts +3 -0
package/dist/wizard/copy/data-science.d.ts.map +1 -0
package/dist/wizard/copy/data-science.js +15 -0
package/dist/wizard/copy/data-science.js.map +1 -0
package/dist/wizard/copy/index.d.ts.map +1 -1
package/dist/wizard/copy/index.js +2 -0
package/dist/wizard/copy/index.js.map +1 -1
package/dist/wizard/copy/types.d.ts +5 -1
package/dist/wizard/copy/types.d.ts.map +1 -1
package/dist/wizard/copy/types.test-d.js +7 -0
package/dist/wizard/copy/types.test-d.js.map +1 -1
package/dist/wizard/questions.d.ts +2 -1
package/dist/wizard/questions.d.ts.map +1 -1
package/dist/wizard/questions.js +9 -1
package/dist/wizard/questions.js.map +1 -1
package/dist/wizard/questions.test.js +14 -0
package/dist/wizard/questions.test.js.map +1 -1
package/dist/wizard/wizard.d.ts.map +1 -1
package/dist/wizard/wizard.js +1 -0
package/dist/wizard/wizard.js.map +1 -1
package/package.json +1 -1
package/skills/mmr/SKILL.md +72 -2
package/skills/scaffold-runner/SKILL.md +65 -19

package/content/knowledge/data-science/data-science-security.md ADDED Viewed

@@ -0,0 +1,151 @@
+---
+name: data-science-security
+description: Practical security guardrails for solo / small-team data-science work — PII masking at ingest, credential hygiene with direnv and 1Password, data classification tiers, notebook output stripping, and a note on model memorization
+topics: [data-science, security, pii, secrets, data-classification]
+---
+DS work has elevated security risk because analysis code routinely touches raw customer data before anyone has had a chance to sanitize it. A notebook can render real names, emails, and account numbers inline, then get committed to git, emailed to a stakeholder, or pasted into Slack without a second thought. Prediction caches and CSV exports quietly duplicate sensitive rows into `data/` subdirectories. Credentials for warehouses and cloud buckets get dropped into `.env` files or — worse — directly into a notebook cell. The blast radius of a sloppy DS workflow is larger than people assume, and the mitigations are not exotic: they are cheap, boring habits that need to be enforced by tooling.
+## Summary
+Mask `PII` at the ingest boundary so downstream notebooks and logs never see raw identifiers — hash emails, truncate names, drop free-text you do not need. Never commit `secrets`; keep local credentials in a gitignored `direnv` `.envrc.local` or, better, inject them at runtime with `1Password` CLI (`op run --`) so they are never written to disk. Classify every dataset as public / internal / confidential / restricted and let the tier decide where it lives — restricted data stays in the warehouse, confidential gets gitignored, internal lives on a shared drive, public is public. Strip notebook outputs with `nbstripout` as a pre-commit hook (or switch to Marimo's `.py` notebooks, which do not embed outputs at all). For fine-tuned or RAG models, assume training data can leak back out through generations and scrub accordingly.
+## Deep Guidance
+### Handling PII
+Identify `PII` at the ingest boundary, not inside your analysis code. The rule is: once a column has left the ingest layer, it should either be pseudonymized (hashed, truncated, bucketed) or stripped. Free-text fields (support tickets, chat logs, notes) are the worst offenders — if the analysis does not require them, drop them. If it does, run them through a scrubber like Presidio or a simple regex pass before they land in a DataFrame.
+Typical categories to handle:
+- **Direct identifiers** — name, email, phone, SSN, account number, precise address. Hash or drop.
+- **Quasi-identifiers** — ZIP + age + gender can re-identify an individual in a surprisingly small population. Bucket aggressively (age → 10-year bands, ZIP → first 3 digits).
+- **Sensitive attributes** — health, financial, biometric. Treat as restricted (see classification below) and keep out of local files entirely.
+- **Free-text** — run through a scrubber or drop unless the analysis genuinely needs the prose.
+A minimal masking helper for structured data:
+```python
+# src/pii.py
+import hashlib
+import pandas as pd
+def _hash_email(email: str, salt: str) -> str:
+    """Deterministic, salted hash — same email maps to same token for joins."""
+    if pd.isna(email):
+        return ""
+    return hashlib.sha256(f"{salt}:{email.lower().strip()}".encode()).hexdigest()[:16]
+def mask_customer_frame(df: pd.DataFrame, salt: str) -> pd.DataFrame:
+    out = df.copy()
+    if "email" in out:
+        out["email_id"] = out["email"].map(lambda e: _hash_email(e, salt))
+        out = out.drop(columns=["email"])
+    if "full_name" in out:
+        # keep first initial for rough demographic analysis, drop the rest
+        out["name_initial"] = out["full_name"].str[:1]
+        out = out.drop(columns=["full_name"])
+    # drop anything we never need
+    for col in ("phone", "ssn", "address", "dob"):
+        if col in out:
+            out = out.drop(columns=[col])
+    return out
+```
+Pair this with a `pandera` schema check on the training-ready DataFrame that asserts sensitive columns are absent — "no bare `email` column, no `ssn` column, no `phone` column." That way a future change that accidentally reintroduces raw PII fails loudly in CI instead of silently:
+```python
+import pandera.pandas as pa
+TrainingSchema = pa.DataFrameSchema(
+    columns={
+        "email_id": pa.Column(str),
+        "name_initial": pa.Column(str, nullable=True),
+        "signup_month": pa.Column("datetime64[ns]"),
+    },
+    strict=True,  # reject any column not listed
+)
+# extra defensive: blacklist raw-PII names in case strict=False is relaxed later
+_FORBIDDEN = {"email", "full_name", "phone", "ssn", "address", "dob"}
+assert not (_FORBIDDEN & set(df.columns)), f"raw PII leaked: {_FORBIDDEN & set(df.columns)}"
+```
+Run this check at the boundary between ingest and modeling, and again before anything gets written to a prediction cache or exported as a report.
+### Credential hygiene
+Never commit `secrets`. There are two patterns worth using locally; pick one per project and be consistent.
+**Pattern 1 — `direnv` with a gitignored `.envrc.local`:**
+```bash
+# .envrc (committed — references local overrides)
+dotenv_if_exists .envrc.local
+# .envrc.local (gitignored — real values live here)
+export WAREHOUSE_URL="postgres://analytics:REAL_PASSWORD@warehouse.internal/prod"
+export AWS_PROFILE="ds-read"
+```
+Add `.envrc.local` and `.env*` to `.gitignore`. `direnv` loads these exports automatically when you `cd` into the project.
+**Pattern 2 — `1Password` CLI with `op run`:**
+```bash
+# .env.1password (committed — references, not values)
+WAREHOUSE_URL=op://DS/warehouse-prod/connection_url
+OPENAI_API_KEY=op://DS/openai/api_key
+# run any command with secrets injected at runtime
+op run --env-file=.env.1password -- python src/train.py
+op run --env-file=.env.1password -- jupyter lab
+```
+`op run` substitutes the `op://` references with real values in the child process's environment and never writes them to disk. The committed `.env.1password` file is safe to share because it contains only vault paths, not secrets. This is the stronger pattern when more than one person needs access — you manage grants in 1Password instead of passing `.envrc.local` files around.
+In production, secrets live in the platform's secret manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) and get injected into the runtime the same way. The governing rule: **if it would go in a `.env` file, it goes in 1Password; if it would go in a secret manager in prod, it stays there — don't duplicate a copy onto your laptop.**
+A few hygiene rules that follow from this:
+- Never paste an API key into a notebook cell, even temporarily. Cells get autosaved, checkpointed, and sometimes committed.
+- Never print a credential to logs — wrap secret-carrying objects in types that redact on `__repr__` (Pydantic's `SecretStr`, for example).
+- Rotate any credential that has ever touched your clipboard, a chat window, or a screen share.
+- Run a pre-commit scanner (`gitleaks` or `detect-secrets`) so a stray key cannot get committed even when the `.envrc.local` pattern is ignored.
+### Data classification
+Classify every dataset against a four-tier rubric and let the tier drive storage and access:
+- **Public** — already on the internet (open datasets, published benchmarks). Can live anywhere, including git.
+- **Internal** — non-sensitive company data (aggregated metrics, anonymized cohorts). Shared private drive or object store with team-level access. Do not commit to git.
+- **Confidential** — business-sensitive but not regulated (revenue breakdowns, customer segments, unreleased product data). Gitignored `data/` directory locally; encrypted bucket with narrow ACL for sharing. Never in notebooks you paste into Slack.
+- **Restricted** — regulated or high-risk PII (health records, payment data, government IDs, raw customer identifiers). Stays in the warehouse or source bucket — **do not download**. Run analysis server-side (dbt model, warehouse notebook, SQL-only pipeline) and only materialize aggregates locally.
+The mapping matters more than the labels. The point of classification is that "can I keep a CSV of this on my laptop?" has a predetermined answer instead of a per-dataset judgment call made while tired.
+Record the classification alongside the data — a one-line `data/README.md` entry per source (`customers_raw: restricted, warehouse-only`) is enough. When a new teammate or a future-you adds a pull, the constraint is visible without having to ask.
+### Notebook output hygiene
+A Jupyter `.ipynb` file is a JSON blob that embeds every cell's rendered output, which means a single `df.head()` on a customer table commits 5 real customer rows to git forever. Strip outputs with `nbstripout` as a pre-commit hook:
+```yaml
+# .pre-commit-config.yaml
+repos:
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.7.1
+    hooks:
+      - id: nbstripout
+        files: \.ipynb$
+```
+Install once with `pre-commit install` and every `git commit` scrubs outputs automatically. Pair with a Jupyter config (`jupyter_notebook_config.py`) that disables output saving entirely if you want belt-and-braces.
+Marimo's `.py`-format notebooks sidestep this problem — they are regular Python files, outputs never get persisted in the notebook, and diffs are reviewable like ordinary code. If you have not picked a notebook format for a new project, prefer Marimo; see `data-science-notebook-discipline` for the broader tradeoffs.
+Whichever format you pick, also keep prediction caches, CSV exports, and ad-hoc scratch files out of git — a broad `data/` and `outputs/` entry in `.gitignore` prevents the most common leak: a confidential sample dataset getting committed as an "example."
+### A word on model memorization
+Fine-tuned LLMs and RAG systems can reproduce training data verbatim under the right prompt. If your fine-tune corpus or retrieval index contains PII, assume it can leak. Mitigations, in order of strength: scrub PII from the corpus before training or indexing (reuse the masking helper above); host the model privately so prompts and responses stay inside your perimeter; apply output filtering to block regex-detectable identifiers on the way out. Do not fine-tune a public base model on raw customer data and then expose it on an open endpoint — that is the failure mode worth avoiding.

package/content/knowledge/data-science/data-science-testing.md ADDED Viewed

@@ -0,0 +1,183 @@
+---
+name: data-science-testing
+description: Testing strategy for solo DS code — pytest for pure functions, pandera for DataFrame schemas at test time and at ingest boundaries, and committed CSV fixtures for deterministic tests
+topics: [data-science, testing, pytest, pandera]
+---
+Data-science code rots quietly. A notebook cell that worked on Tuesday's snapshot silently breaks on Friday's because an upstream column was renamed, a dtype shifted from `int64` to `float64`, or a categorical grew a new level nobody tested for. Refactors that move feature logic out of a notebook into `src/` routinely regress because there was no test pinning the old behavior. Tests catch these failures at the line that introduced them instead of at the end of a three-hour pipeline run.
+## Summary
+Treat DS testing as three separate layers with distinct tools. Use `pytest` for pure-function unit tests — feature engineering, metric calculations, preprocessing helpers in `src/`. Use `pandera` for DataFrame-level contracts: schemas assert column names, dtypes, value ranges, and non-null expectations, and those same schemas run both in tests and at runtime at ingest boundaries. Use committed CSV fixtures in `tests/fixtures/` loaded through pytest fixtures for deterministic, reviewable test data. Keep this doc's scope to CODE correctness; model quality (AUC, calibration, drift) belongs in `data-science-model-evaluation.md`.
+## Deep Guidance
+### Unit tests with pytest
+Every helper in `src/` that transforms data is a pure function candidate for `pytest`. Arrange small inputs, act by calling the function, assert on the output. If a helper reaches for a database or filesystem, push that I/O out to the caller so the core logic stays testable without mocks.
+```python
+# tests/test_features.py
+import numpy as np
+import pandas as pd
+import pytest
+from src.features import impute_missing_ages
+class TestImputeMissingAges:
+    def test_fills_nan_with_median(self):
+        df = pd.DataFrame({"age": [10.0, 20.0, 30.0, np.nan]})
+        result = impute_missing_ages(df)
+        assert result["age"].isna().sum() == 0
+        assert result.loc[3, "age"] == 20.0  # median of [10, 20, 30]
+    def test_preserves_non_null_values(self):
+        df = pd.DataFrame({"age": [10.0, 20.0, 30.0, np.nan]})
+        result = impute_missing_ages(df)
+        pd.testing.assert_series_equal(
+            result.loc[:2, "age"], df.loc[:2, "age"], check_names=False
+        )
+    def test_all_nan_raises(self):
+        df = pd.DataFrame({"age": [np.nan, np.nan]})
+        with pytest.raises(ValueError, match="all-null"):
+            impute_missing_ages(df)
+```
+Run with `pytest -q`. Add `--cov=src` via `pytest-cov` once the project has more than a handful of helpers; aim for coverage on feature-engineering and metrics modules, not notebooks.
+Four rules keep this layer productive:
+- **Name tests after the behavior, not the function**: `test_fills_nan_with_median` beats `test_impute_missing_ages_1`. The name is the failure message when CI turns red.
+- **One assertion family per test**: a test checks either output values, or output shape, or error behavior — not all three. Split into three tests. Failures point at the broken property immediately.
+- **Use `pd.testing.assert_frame_equal` and `np.testing.assert_allclose`**: never compare DataFrames with `==` or floats with exact equality. Pass `rtol`/`atol` explicitly so the tolerance is visible in the test.
+- **Mark slow tests**: decorate any test that loads a non-trivial dataset with `@pytest.mark.slow` and run the default suite with `-m "not slow"` so `pytest` stays under ~5 seconds on save.
+### Data-frame validation with pandera
+Column drift is the single most common source of silent DS bugs. `pandera` encodes a DataFrame contract once and reuses it as a test assertion and a runtime guard at ingest boundaries — the moment a CSV, parquet file, or API response becomes a DataFrame.
+```python
+# src/schemas.py
+import pandera.pandas as pa
+from pandera.typing import Series
+class CustomersSchema(pa.DataFrameModel):
+    customer_id: Series[int] = pa.Field(unique=True, ge=0)
+    age: Series[float] = pa.Field(ge=0, le=120, nullable=True)
+    signup_date: Series[pa.DateTime]
+    segment: Series[str] = pa.Field(isin=["free", "pro", "enterprise"])
+    class Config:
+        strict = True  # reject unexpected columns
+# src/ingest.py — runtime validation at the boundary
+from src.schemas import CustomersSchema
+def load_customers(path: str) -> pd.DataFrame:
+    df = pd.read_csv(path, parse_dates=["signup_date"])
+    return CustomersSchema.validate(df)  # raises SchemaError on violation
+```
+The same schema doubles as a test fixture contract:
+```python
+# tests/test_ingest.py
+import pandas as pd
+import pytest
+from pandera.errors import SchemaError
+from src.ingest import load_customers
+def test_rejects_invalid_segment(tmp_path):
+    bad = tmp_path / "bad.csv"
+    bad.write_text("customer_id,age,signup_date,segment\n1,30,2024-01-01,vip\n")
+    with pytest.raises(SchemaError, match="segment"):
+        load_customers(str(bad))
+```
+Prefer `schema.validate(df)` calls over the `@pa.check_input` decorator — explicit validation is easier to trace in stack traces and does not hide behind import-time decoration.
+Three patterns make pandera pay off:
+- **Validate once at the boundary, trust downstream**: call `Schema.validate(df)` inside `load_customers`, `load_orders`, or whatever function first produces a DataFrame. Downstream code can then assume columns, dtypes, and ranges without re-checking.
+- **Use `lazy=True` during development**: `Schema.validate(df, lazy=True)` collects every violation instead of failing on the first, which is dramatically faster when fixing a bad CSV.
+- **Version schemas alongside migrations**: when a column renames or a new category lands, update the schema in the same PR as the code change. Schema drift caught in code review is cheaper than schema drift caught in production.
+### Fixtures: deterministic test data
+Random DataFrames in tests produce flaky failures that are painful to debug. Commit small, hand-curated CSVs to `tests/fixtures/` and load them through pytest `fixture` functions. The CSVs are reviewable in PRs, the fixtures are reusable across test modules.
+```python
+# tests/conftest.py
+from pathlib import Path
+import pandas as pd
+import pytest
+FIXTURES = Path(__file__).parent / "fixtures"
+@pytest.fixture
+def customers_df() -> pd.DataFrame:
+    return pd.read_csv(FIXTURES / "customers_small.csv", parse_dates=["signup_date"])
+@pytest.fixture(params=["customers_empty.csv", "customers_one_row.csv", "customers_small.csv"])
+def customers_edge_cases(request) -> pd.DataFrame:
+    return pd.read_csv(FIXTURES / request.param, parse_dates=["signup_date"])
+```
+Use `@pytest.mark.parametrize` to cover multiple scenarios without duplicating test bodies:
+```python
+@pytest.mark.parametrize(
+    "segment,expected_discount",
+    [("free", 0.0), ("pro", 0.1), ("enterprise", 0.2)],
+)
+def test_discount_by_segment(segment, expected_discount):
+    assert compute_discount(segment) == expected_discount
+```
+Keep fixture CSVs under ~50 rows. Anything larger belongs in a `data/` directory and should be generated or downloaded, not committed.
+When a test genuinely needs a larger or procedurally generated DataFrame, build it deterministically with a seeded RNG inside a fixture — never inline, and never with the global `np.random` state:
+```python
+@pytest.fixture
+def synthetic_transactions() -> pd.DataFrame:
+    rng = np.random.default_rng(seed=42)
+    n = 1000
+    return pd.DataFrame({
+        "user_id": rng.integers(0, 100, size=n),
+        "amount": rng.lognormal(mean=3.0, sigma=1.0, size=n),
+        "ts": pd.date_range("2024-01-01", periods=n, freq="1h"),
+    })
+```
+A fixed seed means the same test always sees the same data, so flaky-failure postmortems are possible instead of "must have been a weird random sample."
+### Running the suite
+Layout and commands stay boring on purpose:
+```
+tests/
+  conftest.py          # shared fixtures
+  fixtures/            # small committed CSVs
+    customers_small.csv
+    customers_empty.csv
+  test_features.py     # pytest for src/features.py
+  test_ingest.py       # pandera + ingest tests
+  test_metrics.py      # pytest for src/metrics.py
+```
+Wire `pytest` into `pyproject.toml` so `pytest` alone runs the right suite:
+```toml
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-q --strict-markers -m 'not slow'"
+markers = ["slow: tests that load non-trivial data"]
+```
+Run the fast suite on every save (a file-watcher like `pytest-watch` helps), and run `pytest -m slow` or `pytest` with no marker filter before each commit. In CI, run the full suite unconditionally.
+### What NOT to test
+Don't unit-test `pandas`, `numpy`, or `pandera` themselves — assume upstream libraries work and pin versions in `pyproject.toml` to catch surprises via dependency bumps, not your own test suite. Don't assert on model quality metrics here (AUC, precision, calibration); those live in `data-science-model-evaluation.md` and run on held-out data, not fixtures. Don't write tests that require a live database, S3 bucket, or trained model file — those belong in integration tests run out-of-band, not the fast `pytest` suite a developer runs on save. And don't test notebooks directly; if a notebook cell has logic worth testing, extract it to `src/` first, then test the function.

package/content/knowledge/ml/README.md ADDED Viewed

@@ -0,0 +1,10 @@
+# `ml/` knowledge
+Production machine-learning domain knowledge injected into universal pipeline
+steps by `content/methodology/ml-overlay.yml`.
+## Lockstep pairs with `data-science/`
+Five documents here mirror documents in `content/knowledge/data-science/`. See
+`content/knowledge/data-science/README.md` for the full pair table. Edits to
+one side should trigger review of the other to prevent recommendation drift.

package/content/methodology/data-science-overlay.yml ADDED Viewed

@@ -0,0 +1,39 @@
+# methodology/data-science-overlay.yml
+name: data-science
+description: >
+  Data science overlay — injects solo / small-team data science domain
+  knowledge into existing pipeline steps for local-first, reproducibility-first
+  analytical work and model prototyping.
+project-type: data-science
+knowledge-overrides:
+  # Foundational
+  create-prd:            { append: [data-science-requirements] }
+  user-stories:          { append: [data-science-requirements] }
+  coding-standards:      { append: [data-science-conventions, data-science-notebook-discipline] }
+  project-structure:     { append: [data-science-project-structure] }
+  dev-env-setup:         { append: [data-science-dev-environment] }
+  git-workflow:          { append: [data-science-reproducibility] }
+  # Architecture & Design
+  system-architecture:   { append: [data-science-architecture] }
+  tech-stack:            { append: [data-science-architecture, data-science-dev-environment] }
+  adrs:                  { append: [data-science-architecture] }
+  domain-modeling:       { append: [data-science-data-versioning] }
+  database-schema:       { append: [data-science-data-versioning] }
+  security:              { append: [data-science-security] }
+  operations:            { append: [data-science-experiment-tracking, data-science-observability, data-science-reproducibility] }
+  # Testing
+  tdd:                   { append: [data-science-testing] }
+  create-evals:          { append: [data-science-testing, data-science-model-evaluation] }
+  # Reviews
+  review-architecture:   { append: [data-science-architecture] }
+  review-database:       { append: [data-science-data-versioning] }
+  review-security:       { append: [data-science-security] }
+  review-operations:     { append: [data-science-experiment-tracking, data-science-observability] }
+  review-testing:        { append: [data-science-testing, data-science-model-evaluation] }
+  # Planning
+  implementation-plan:   { append: [data-science-architecture] }

package/content/pipeline/build/multi-agent-resume.md CHANGED Viewed

@@ -177,14 +177,15 @@ Once in-progress work is complete (or if there was none):
 4. **Run code reviews (MANDATORY)**
    - Run the review-pr tool: `scaffold run review-pr` (CLI) or `/scaffold:review-pr` (plugin)
-   - This runs **all three** review channels on the PR diff:
+   - This runs the three MMR CLI channels on the PR diff plus the Superpowers code-reviewer agent as a complementary 4th channel reconciled through `mmr reconcile`:
      1. **Codex CLI**: `codex exec --skip-git-repo-check -s read-only --ephemeral "REVIEW_PROMPT" 2>/dev/null`
      2. **Gemini CLI**: `NO_BROWSER=true gemini -p "REVIEW_PROMPT" --output-format json --approval-mode yolo 2>/dev/null`
-     3. **Superpowers code-reviewer**: dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
-   - Verify auth before each CLI (`codex login status`, `NO_BROWSER=true gemini -p "respond with ok" -o json`)
-   - All three channels must execute (skip only if a tool is genuinely not installed)
+     3. **Claude CLI**: `claude -p "REVIEW_PROMPT" --output-format json 2>/dev/null`
+     4. **Superpowers code-reviewer** (4th channel): dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
+   - Verify auth before each CLI (`mmr config test` pre-flights all three at once)
+   - All four channels should execute. Missing Codex or Gemini → MMR runs a compensating Claude pass in its place (degraded-pass verdict). Missing Claude CLI → review proceeds without compensation.
    - Fix any P0/P1/P2 findings before proceeding
-   - Do NOT move to the next task until all channels have run
+   - Do NOT move to the next task until the review completes
 5. **Between-task cleanup**
    - `git fetch origin --prune && git clean -fd`
@@ -238,7 +239,7 @@ Once in-progress work is complete (or if there was none):
 5. **TDD is not optional** — Continue the red-green-refactor cycle for any in-progress work.
 6. **Quality gates before PR** — Never create a PR with failing checks.
 7. **Honor pre-push review when requested** — If the user or project workflow asks for pre-push multi-model review, run `scaffold run review-code` after quality gates and before `git push`.
-8. **Code review before next task** — After creating a PR, run all three review channels (Codex CLI, Gemini CLI, Superpowers code-reviewer) and fix all P0/P1/P2 findings before moving on.
+8. **Code review before next task** — After creating a PR, run `scaffold run review-pr`: three CLI channels (Codex CLI, Gemini CLI, Claude CLI) via MMR plus the Superpowers code-reviewer agent as a complementary 4th channel. Fix all P0/P1/P2 findings before moving on.
 9. **Follow CLAUDE.md** — It is the authority on project conventions and commands.
 ---

package/content/pipeline/build/multi-agent-start.md CHANGED Viewed

@@ -181,14 +181,15 @@ For each task:
 8. **Run code reviews (MANDATORY)**
    - Run the review-pr tool: `scaffold run review-pr` (CLI) or `/scaffold:review-pr` (plugin)
-   - This runs **all three** review channels on the PR diff:
+   - This runs the three MMR CLI channels on the PR diff plus the Superpowers code-reviewer agent as a complementary 4th channel reconciled through `mmr reconcile`:
      1. **Codex CLI**: `codex exec --skip-git-repo-check -s read-only --ephemeral "REVIEW_PROMPT" 2>/dev/null`
      2. **Gemini CLI**: `NO_BROWSER=true gemini -p "REVIEW_PROMPT" --output-format json --approval-mode yolo 2>/dev/null`
-     3. **Superpowers code-reviewer**: dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
-   - Verify auth before each CLI (`codex login status`, `NO_BROWSER=true gemini -p "respond with ok" -o json`)
-   - All three channels must execute (skip only if a tool is genuinely not installed)
+     3. **Claude CLI**: `claude -p "REVIEW_PROMPT" --output-format json 2>/dev/null`
+     4. **Superpowers code-reviewer** (4th channel): dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
+   - Verify auth before each CLI (`mmr config test` pre-flights all three at once)
+   - All four channels should execute. Missing Codex or Gemini → MMR runs a compensating Claude pass in its place (degraded-pass verdict). Missing Claude CLI → review proceeds without compensation.
    - Fix any P0/P1/P2 findings before proceeding
-   - Do NOT move to the next task until all channels have run
+   - Do NOT move to the next task until the review completes
 9. **Between-task cleanup**
    - `git fetch origin --prune && git clean -fd`
@@ -230,7 +231,7 @@ For each task:
 4. **TDD is not optional** — Write failing tests before implementation. No exceptions.
 5. **Quality gates before PR** — Never create a PR with failing checks.
 6. **Honor pre-push review when requested** — If the user or project workflow asks for pre-push multi-model review, run `scaffold run review-code` after quality gates and before `git push`.
-7. **Code review before next task** — After creating a PR, run all three review channels (Codex CLI, Gemini CLI, Superpowers code-reviewer) and fix all P0/P1/P2 findings before moving on.
+7. **Code review before next task** — After creating a PR, run `scaffold run review-pr`: three CLI channels (Codex CLI, Gemini CLI, Claude CLI) via MMR plus the Superpowers code-reviewer agent as a complementary 4th channel. Fix all P0/P1/P2 findings before moving on.
 8. **Avoid task conflicts** — Check what other agents are working on before claiming.
 9. **Follow CLAUDE.md** — It is the authority on project conventions and commands.

package/content/pipeline/build/single-agent-resume.md CHANGED Viewed

@@ -154,14 +154,15 @@ Once in-progress work is complete (or if there was none):
 4. **Run code reviews (MANDATORY)**
    - Run the review-pr tool: `scaffold run review-pr` (CLI) or `/scaffold:review-pr` (plugin)
-   - This runs **all three** review channels on the PR diff:
+   - This runs the three MMR CLI channels on the PR diff plus the Superpowers code-reviewer agent as a complementary 4th channel reconciled through `mmr reconcile`:
      1. **Codex CLI**: `codex exec --skip-git-repo-check -s read-only --ephemeral "REVIEW_PROMPT" 2>/dev/null`
      2. **Gemini CLI**: `NO_BROWSER=true gemini -p "REVIEW_PROMPT" --output-format json --approval-mode yolo 2>/dev/null`
-     3. **Superpowers code-reviewer**: dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
-   - Verify auth before each CLI (`codex login status`, `NO_BROWSER=true gemini -p "respond with ok" -o json`)
-   - All three channels must execute (skip only if a tool is genuinely not installed)
+     3. **Claude CLI**: `claude -p "REVIEW_PROMPT" --output-format json 2>/dev/null`
+     4. **Superpowers code-reviewer** (4th channel): dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
+   - Verify auth before each CLI (`mmr config test` pre-flights all three at once)
+   - All four channels should execute. Missing Codex or Gemini → MMR runs a compensating Claude pass in its place (degraded-pass verdict). Missing Claude CLI → review proceeds without compensation.
    - Fix any P0/P1/P2 findings before proceeding
-   - Do NOT move to the next task until all channels have run
+   - Do NOT move to the next task until the review completes
 5. **Claim next task**
    - Return to main: `git checkout main && git pull origin main`
@@ -203,7 +204,7 @@ Once in-progress work is complete (or if there was none):
 4. **TDD is not optional** — Continue the red-green-refactor cycle for any in-progress work.
 5. **Quality gates before PR** — Never create a PR with failing checks.
 6. **Honor pre-push review when requested** — If the user or project workflow asks for pre-push multi-model review, run `scaffold run review-code` after quality gates and before `git push`.
-7. **Code review before next task** — After creating a PR, run all three review channels (Codex CLI, Gemini CLI, Superpowers code-reviewer) and fix all P0/P1/P2 findings before moving on.
+7. **Code review before next task** — After creating a PR, run `scaffold run review-pr`: three CLI channels (Codex CLI, Gemini CLI, Claude CLI) via MMR plus the Superpowers code-reviewer agent as a complementary 4th channel. Fix all P0/P1/P2 findings before moving on.
 8. **Follow CLAUDE.md** — It is the authority on project conventions and commands.
 ---

package/content/pipeline/build/single-agent-start.md CHANGED Viewed

@@ -160,14 +160,15 @@ For each task:
 8. **Run code reviews (MANDATORY)**
    - Run the review-pr tool: `scaffold run review-pr` (CLI) or `/scaffold:review-pr` (plugin)
-   - This runs **all three** review channels on the PR diff:
+   - This runs the three MMR CLI channels on the PR diff plus the Superpowers code-reviewer agent as a complementary 4th channel reconciled through `mmr reconcile`:
      1. **Codex CLI**: `codex exec --skip-git-repo-check -s read-only --ephemeral "REVIEW_PROMPT" 2>/dev/null`
      2. **Gemini CLI**: `NO_BROWSER=true gemini -p "REVIEW_PROMPT" --output-format json --approval-mode yolo 2>/dev/null`
-     3. **Superpowers code-reviewer**: dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
-   - Verify auth before each CLI (`codex login status`, `NO_BROWSER=true gemini -p "respond with ok" -o json`)
-   - All three channels must execute (skip only if a tool is genuinely not installed)
+     3. **Claude CLI**: `claude -p "REVIEW_PROMPT" --output-format json 2>/dev/null`
+     4. **Superpowers code-reviewer** (4th channel): dispatch `superpowers:code-reviewer` subagent with BASE_SHA and HEAD_SHA
+   - Verify auth before each CLI (`mmr config test` pre-flights all three at once)
+   - All four channels should execute. Missing Codex or Gemini → MMR runs a compensating Claude pass in its place (degraded-pass verdict). Missing Claude CLI → review proceeds without compensation.
    - Fix any P0/P1/P2 findings before proceeding
-   - Do NOT move to the next task until all channels have run
+   - Do NOT move to the next task until the review completes
 9. **Update status**
    - If Beads: task status is managed via `bd` commands
@@ -201,7 +202,7 @@ For each task:
 2. **One task at a time** — Complete the current task fully before starting the next.
 3. **Quality gates before PR** — Never create a PR with failing checks.
 4. **Honor pre-push review when requested** — If the user or project workflow asks for pre-push multi-model review, run `scaffold run review-code` after quality gates and before `git push`.
-5. **Code review before next task** — After creating a PR, run all three review channels (Codex CLI, Gemini CLI, Superpowers code-reviewer) and fix all P0/P1/P2 findings before moving on.
+5. **Code review before next task** — After creating a PR, run `scaffold run review-pr`: three CLI channels (Codex CLI, Gemini CLI, Claude CLI) via MMR plus the Superpowers code-reviewer agent as a complementary 4th channel. Fix all P0/P1/P2 findings before moving on.
 6. **Update status immediately** — Mark tasks complete as soon as review passes.
 7. **Consult lessons.md** — Check for relevant anti-patterns before each task.
 8. **Follow CLAUDE.md** — It is the authority on project conventions and commands.