npm - medsci-skills - Versions diffs - 4.1.0 - Mend

medsci-skills 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (702) hide show

package/skills/generate-codebook/SKILL.md ADDED Viewed

@@ -0,0 +1,155 @@
+---
+name: generate-codebook
+description: Generate a citable data dictionary / codebook from a tabular dataset (CSV/TSV/Excel/Parquet/Stata/SAS). Profiles every variable — role, type, units placeholder, level frequencies, range/quantiles, missingness — and emits codebook.md + codebook.json. Flags coded variables whose level meanings are unknown as [NEEDS DICTIONARY] rather than guessing them, feeding /define-variables and the dictionary-first workflow.
+triggers: generate codebook, data dictionary, codebook, profile variables, variable dictionary, describe dataset, what variables, column dictionary, build codebook
+tools: Read, Write, Edit, Bash, Grep, Glob
+model: inherit
+---
+# Generate Codebook Skill
+You help a medical researcher turn a raw tabular dataset into a structured,
+**citable** data dictionary (codebook). This is the *generator* side of the
+dictionary-first workflow: it produces the artifact that `/define-variables` and
+dictionary-first QC later consume. You generate code and review output — you do
+**not** invent the meaning of coded values.
+## Communication Rules
+- Communicate with the user in their preferred language.
+- Variable names, codebook fields, and report output are in English.
+- Medical terminology is always in English.
+## Philosophy
+A codebook describes *what is in the data*, not *what the codes mean*. Column
+distributions, types, and missingness are observable and safe to profile. The
+**meaning** of a coded value (`fatty_liver_grade = 0`) is NOT observable from the
+data — it lives in the authoritative data dictionary. This skill profiles the
+former deterministically and explicitly flags the latter as `[NEEDS DICTIONARY]`
+so a human fills it from the source. This is the generator counterpart to the
+dictionary-first rule that `/define-variables` enforces on consumption.
+## Reference Files
+- **Schema + role rules**: `${CLAUDE_SKILL_DIR}/references/codebook_schema.md` — the
+  codebook.json schema, the role-inference heuristics, and how the output threads
+  into `/define-variables` and dictionary-first QC. Read this before interpreting output.
+## Deterministic Script
+Run the bundled profiler rather than describing columns from memory:
+```bash
+python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" data.csv --out-dir .
+```
+Supports `.csv/.tsv/.xlsx/.parquet/.dta/.sas7bdat`. Flags: `--max-levels N`
+(categorical cutoff, default 20), `--json-only`, `--md-only`. The script is
+pandas-only, runs locally, and never sends data anywhere.
+## Workflow
+### Step 1: Profile (deterministic)
+Run `generate_codebook.py` on the dataset. It writes `codebook.json` (machine-
+readable) and `codebook.md` (review table), reporting per variable: role
+(id / continuous / categorical / binary / date / text), dtype, missingness,
+unique count, level frequencies or quantile summary, and a `needs_dictionary` flag.
+### Step 2: Review with the researcher (gate)
+Present `codebook.md` and walk the user through it. **Gate:** the user confirms
+the inferred roles (e.g., an integer-coded scale mis-read as continuous, or an id
+column). Do not proceed to definition work until the user approves the role
+assignments.
+### Step 3: Resolve [NEEDS DICTIONARY] items (gate)
+For every variable flagged `needs_dictionary: true`, the level codes are
+uninterpretable without the authoritative source. **Gate:** ask the user to
+supply the meaning of each code from the real data dictionary (file/sheet/row),
+or to confirm none exists. Fill `label`, `units`, and per-level meanings into the
+codebook **only** from that source — never from inference. If the user cannot
+supply it, leave the `[NEEDS DICTIONARY]` marker in place; do not erase it.
+### Step 4: Hand off
+The completed `codebook.json` becomes the input dictionary for `/define-variables`
+(operationalization) and the citation source for dictionary-first QC. **Gate:**
+confirm with the user that no `needs_dictionary` flags remain unresolved before
+the codebook is treated as authoritative for downstream analysis.
+## Scope Limitations
+### Supported
+- Tabular files: CSV, TSV, Excel, Parquet, Stata (`.dta`), SAS (`.sas7bdat`).
+- Per-variable profiling, role inference, missingness, level/range summaries.
+### NOT Supported
+- Inventing or guessing the meaning of coded values (that is `[NEEDS DICTIONARY]`).
+- Cleaning or transforming data — use `/clean-data`.
+- De-identification — use `/deidentify` before sharing.
+- Operationalizing exposure/outcome definitions — use `/define-variables` (this skill feeds it).
+## Cross-Skill Integration
+- **/define-variables** consumes `codebook.json` as its data dictionary input.
+- **/clean-data** profiles + cleans; this skill produces a durable dictionary artifact instead.
+- **/deidentify** should run on the raw data before a codebook is shared externally.
+## Output Format
+`codebook.json` (schema in references) and `codebook.md` (review table with a
+"Columns requiring dictionary lookup" section). Summarize the counts
+(rows, columns, `needs_dictionary_count`) in chat; do not paste the full JSON.
+## Worked Example
+Input `cohort.csv`:
+```text
+patient_id,age,sex,fatty_liver_grade,smoking_status,visit_date
+1001,54,1,0,never,2023-01-15
+1002,61,2,2,former,2023-02-03
+```
+Run:
+```bash
+python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" cohort.csv --out-dir .
+# -> {"n_rows": ..., "n_columns": 6, "needs_dictionary_count": 2, "outputs": [...]}
+```
+`codebook.md` (excerpt):
+```text
+| Variable            | Role        | Missing % | Unique | Needs dictionary |
+| `patient_id`        | id          | 0.0       | N      |                  |
+| `age`               | continuous  | 0.0       | ...    |                  |
+| `sex`               | binary      | 0.0       | 2      | ⚠️ YES           |
+| `fatty_liver_grade` | categorical | 0.0       | 5      | ⚠️ YES           |
+| `smoking_status`    | categorical | 0.0       | 3      |                  |
+| `visit_date`        | date        | 0.0       | ...    |                  |
+```
+`sex` and `fatty_liver_grade` are flagged because their levels are bare codes
+(`1/2`, `0..4`). `smoking_status` is **not** flagged — its levels are already
+human-readable. The reviewer then:
+1. Opens the project's authoritative data dictionary.
+2. Fills `sex`: `1 = male, 2 = female` and `fatty_liver_grade`: `0 = none … 4 = suspected`
+   into the codebook **from that source** (citing file > sheet > row).
+3. Confirms no `[NEEDS DICTIONARY]` flags remain, then hands `codebook.json` to
+   `/define-variables`.
+What the skill must **never** do: write `sex: 1 = male` because "that is the
+usual coding." If the dictionary is unavailable, the flag stays.
+## Anti-Hallucination
+- Never invent a variable's label, units, or the meaning of any coded level.
+- Coded categorical/binary columns with bare codes are flagged `[NEEDS DICTIONARY]`;
+  the meaning is filled only from the authoritative data dictionary, then cited.
+- Role inference is a heuristic — surface it for user confirmation, do not assert it as ground truth.
+- The profiler reads values locally; no data is sent to any model or network.

package/skills/generate-codebook/references/codebook_schema.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Codebook Schema & Role Inference
+`generate_codebook.py` emits `codebook.json` (machine-readable) and `codebook.md`
+(human review). This file documents the JSON schema, the role-inference rules,
+and how the output threads into downstream skills.
+## codebook.json schema (schema_version 1)
+```jsonc
+{
+  "schema_version": 1,
+  "source": "path/to/data.csv",
+  "n_rows": 200,
+  "n_columns": 9,
+  "needs_dictionary_count": 2,
+  "columns": [
+    {
+      "name": "fatty_liver_grade",
+      "role": "categorical",          // id | continuous | categorical | binary | date | text
+      "dtype": "int64",
+      "n": 200,
+      "n_missing": 0,
+      "pct_missing": 0.0,
+      "n_unique": 5,
+      "label": null,                   // filled by researcher from the authoritative dictionary
+      "units": null,                   // filled by researcher
+      "needs_dictionary": true,        // true => level meanings are unknown; do NOT guess
+      "notes": ["[NEEDS DICTIONARY] level codes are uninterpretable ..."],
+      "levels": [{"value": 0, "count": 41}, {"value": 1, "count": 39}],  // categorical/binary
+      "stats": {"min": 0, "q1": 1, "median": 2, "q3": 3, "max": 4},      // continuous/date
+      "examples": ["0", "2", "1"]
+    }
+  ]
+}
+```
+`label`, `units`, and per-level meanings are intentionally left `null` / unlabelled.
+They are filled **only** from the authoritative data dictionary, never inferred.
+## Role inference (heuristic — confirm with the user)
+Decided in this order; dtype and column name dominate so that continuous
+measurements are never misread as identifiers on small datasets:
+1. **date** — datetime dtype, or >80% of a sample parses as dates.
+2. **binary** — exactly 2 distinct non-null values.
+3. **numeric dtype:**
+   - **id** — integer-valued, id-like name (`*_id`, `uid`, `mrn`, `subject`, `patient`, `record`, `accession`), and unique count > `--max-levels`.
+   - **categorical** — integer-valued and unique count ≤ `--max-levels` (coded scale).
+   - **continuous** — otherwise (floats, or many distinct integers).
+4. **object/string:**
+   - **id** — id-like name with high cardinality, or all-unique on ≥50 rows.
+   - **categorical** — unique count ≤ `--max-levels`.
+   - **text** — otherwise.
+`--max-levels` (default 20) is the categorical cutoff. Raise it for scales with
+many levels; lower it to force more columns to `continuous`/`text`.
+## needs_dictionary flag
+A categorical/binary column is flagged `needs_dictionary: true` when its levels
+are **bare codes** — integers, or short tokens like `Y`/`N`/`M`/`U`/`NA` — i.e.,
+uninterpretable without the source dictionary. A column whose levels are already
+human-readable (`never` / `former` / `current`) is **not** flagged. The flag is a
+prompt for the researcher to fill meanings from the authoritative dictionary; it
+is never resolved by guessing.
+## Downstream integration
+- **/define-variables** takes `codebook.json` as its data-dictionary input and
+  operationalizes exposure/outcome/covariate definitions on top of it. Unresolved
+  `needs_dictionary` flags should be cleared first.
+- **dictionary-first QC** cites the codebook (file > variable) as the provenance
+  for a coded value's meaning. The codebook is the artifact that citation points at.
+- **/clean-data** and **/deidentify** operate on the raw data; run `/deidentify`
+  before a codebook (which contains example values) is shared externally.

package/skills/generate-codebook/scripts/generate_codebook.py ADDED Viewed

@@ -0,0 +1,278 @@
+#!/usr/bin/env python3
+"""Generate a citable data dictionary / codebook from a tabular dataset.
+Profiles every column of a dataset and emits two artifacts:
+  - codebook.json  — machine-readable (consumed by /define-variables and
+                      dictionary-first QC)
+  - codebook.md    — human-readable table for review/sharing
+Hard anti-hallucination rule: the meaning of coded values is NEVER invented. A
+categorical/binary column whose levels are bare codes (0/1/2, "Y"/"N", ...) is
+flagged `needs_dictionary: true` with a `[NEEDS DICTIONARY]` note, so the
+researcher fills the meaning from the authoritative data dictionary rather than
+the model guessing. This is the generator side of the dictionary-first rule.
+Profiling is deterministic and local — pandas only, no network, no LLM touches
+the values. Optional engines (openpyxl/pyarrow/Stata) are used only if present.
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+try:
+    import pandas as pd
+except ImportError:
+    print("ERROR: pandas is required (pip install pandas).", file=sys.stderr)
+    sys.exit(2)
+CATEGORICAL_MAX_LEVELS_DEFAULT = 20
+TOP_LEVELS = 15
+EXAMPLES = 3
+def read_table(path: Path) -> "pd.DataFrame":
+    suf = path.suffix.lower()
+    if suf in (".csv", ".txt"):
+        return pd.read_csv(path)
+    if suf in (".tsv",):
+        return pd.read_csv(path, sep="\t")
+    if suf in (".xlsx", ".xls"):
+        return pd.read_excel(path)            # needs openpyxl/xlrd
+    if suf in (".parquet", ".pq"):
+        return pd.read_parquet(path)          # needs pyarrow/fastparquet
+    if suf in (".dta",):
+        return pd.read_stata(path)
+    if suf in (".sas7bdat",):
+        return pd.read_sas(path)
+    # default: try CSV
+    return pd.read_csv(path)
+def _looks_like_date(series: "pd.Series") -> bool:
+    if pd.api.types.is_datetime64_any_dtype(series):
+        return True
+    s = series.dropna().astype(str).head(50)
+    if s.empty:
+        return False
+    parsed = pd.to_datetime(s, errors="coerce", format="mixed") if hasattr(pd, "__version__") else pd.to_datetime(s, errors="coerce")
+    return parsed.notna().mean() > 0.8
+MIN_ROWS_FOR_UNIQUENESS_ID = 50  # below this, "all-unique" is not a reliable id signal
+def _is_integer_valued(series: "pd.Series") -> bool:
+    if pd.api.types.is_integer_dtype(series):
+        return True
+    nn = series.dropna()
+    if nn.empty or not pd.api.types.is_numeric_dtype(series):
+        return False
+    try:
+        return bool((nn.astype(float) % 1 == 0).all())
+    except Exception:
+        return False
+def infer_role(series: "pd.Series", n_rows: int, n_unique: int, max_levels: int, name: str) -> str:
+    """Dtype- and name-driven role inference.
+    Order matters: date and binary are decided before id, and id is conservative
+    (a float column or a small dataset's all-unique column is NOT an id — that
+    misclassifies continuous measurements as identifiers on small data).
+    """
+    name_l = name.lower()
+    id_name = name_l == "id" or any(
+        tok in name_l for tok in ("_id", "id_", "uid", "mrn", "subject", "patient", "record", "accession")
+    )
+    if _looks_like_date(series):
+        return "date"
+    if n_unique == 2:
+        return "binary"
+    if pd.api.types.is_numeric_dtype(series):
+        intlike = _is_integer_valued(series)
+        # id only for integer-valued, high-cardinality, id-named columns
+        if id_name and intlike and n_unique > max_levels:
+            return "id"
+        if intlike and n_unique <= max_levels:
+            return "categorical"
+        return "continuous"
+    # object / string
+    if id_name and n_unique > max_levels:
+        return "id"
+    if n_unique == n_rows and n_rows >= MIN_ROWS_FOR_UNIQUENESS_ID:
+        return "id"
+    if n_unique <= max_levels:
+        return "categorical"
+    return "text"
+def _coded_levels(series: "pd.Series") -> bool:
+    """True when the categorical/binary levels are bare codes needing a dictionary."""
+    vals = series.dropna().unique().tolist()
+    if not vals:
+        return False
+    for v in vals:
+        if pd.api.types.is_number(v):
+            continue
+        s = str(v).strip()
+        # short tokens / pure codes look uninterpretable without a dictionary
+        if len(s) <= 3 or s.isdigit() or s.upper() in ("Y", "N", "T", "F", "M", "U", "NA"):
+            continue
+        return False  # at least one human-readable label present
+    return True
+def profile_column(df: "pd.DataFrame", col: str, n_rows: int, max_levels: int) -> dict:
+    s = df[col]
+    n_missing = int(s.isna().sum())
+    nonnull = s.dropna()
+    n_unique = int(nonnull.nunique())
+    role = infer_role(s, n_rows, n_unique, max_levels, col)
+    rec: dict = {
+        "name": col,
+        "role": role,
+        "dtype": str(s.dtype),
+        "n": int(n_rows),
+        "n_missing": n_missing,
+        "pct_missing": round(100.0 * n_missing / n_rows, 2) if n_rows else 0.0,
+        "n_unique": n_unique,
+        "label": None,                 # to be filled by researcher from the real dictionary
+        "units": None,                 # to be filled by researcher
+        "needs_dictionary": False,
+        "notes": [],
+    }
+    if role == "continuous":
+        desc = nonnull.astype(float)
+        if not desc.empty:
+            rec["stats"] = {
+                "min": float(desc.min()), "q1": float(desc.quantile(0.25)),
+                "median": float(desc.median()), "q3": float(desc.quantile(0.75)),
+                "max": float(desc.max()), "mean": round(float(desc.mean()), 4),
+                "sd": round(float(desc.std()), 4) if len(desc) > 1 else 0.0,
+            }
+        rec["notes"].append("[NEEDS DICTIONARY] confirm units and measurement method")
+    elif role in ("categorical", "binary"):
+        vc = nonnull.value_counts().head(TOP_LEVELS)
+        rec["levels"] = [{"value": (int(v) if pd.api.types.is_number(v) and float(v).is_integer() else
+                                    (float(v) if pd.api.types.is_number(v) else str(v))),
+                          "count": int(c)} for v, c in vc.items()]
+        if _coded_levels(nonnull):
+            rec["needs_dictionary"] = True
+            rec["notes"].append("[NEEDS DICTIONARY] level codes are uninterpretable without the authoritative data dictionary — do not guess meanings")
+    elif role == "date":
+        try:
+            d = pd.to_datetime(nonnull, errors="coerce")
+            rec["stats"] = {"min": str(d.min()), "max": str(d.max())}
+        except Exception:
+            pass
+        rec["notes"].append("[NEEDS DICTIONARY] confirm whether this is event / measurement / enrollment date")
+    elif role == "id":
+        rec["notes"].append("identifier candidate (high/maximal cardinality) — exclude from analysis variables")
+    examples = nonnull.head(EXAMPLES).tolist()
+    rec["examples"] = [str(x) for x in examples]
+    return rec
+def build_codebook(df: "pd.DataFrame", source: str, max_levels: int) -> dict:
+    n_rows = len(df)
+    cols = [profile_column(df, c, n_rows, max_levels) for c in df.columns]
+    return {
+        "schema_version": 1,
+        "source": source,
+        "n_rows": n_rows,
+        "n_columns": len(df.columns),
+        "needs_dictionary_count": sum(1 for c in cols if c["needs_dictionary"]),
+        "columns": cols,
+    }
+def render_md(cb: dict) -> str:
+    lines = [
+        f"# Codebook — {Path(cb['source']).name}",
+        "",
+        f"- Rows: {cb['n_rows']}",
+        f"- Columns: {cb['n_columns']}",
+        f"- Columns needing a data dictionary: **{cb['needs_dictionary_count']}**",
+        "",
+        "> `[NEEDS DICTIONARY]` rows require the meaning to be filled from the "
+        "authoritative data dictionary. Meanings were **not** guessed.",
+        "",
+        "| Variable | Role | Dtype | Missing % | Unique | Summary | Needs dictionary |",
+        "|---|---|---|---|---|---|---|",
+    ]
+    for c in cb["columns"]:
+        if c["role"] == "continuous" and "stats" in c:
+            s = c["stats"]
+            summ = f"median {s['median']} [{s['q1']}–{s['q3']}], range {s['min']}–{s['max']}"
+        elif c["role"] in ("categorical", "binary") and c.get("levels"):
+            summ = ", ".join(f"{l['value']}={l['count']}" for l in c["levels"][:6])
+            if len(c["levels"]) > 6:
+                summ += ", …"
+        elif c["role"] == "date" and "stats" in c:
+            summ = f"{c['stats'].get('min','?')} → {c['stats'].get('max','?')}"
+        else:
+            summ = ", ".join(c.get("examples", [])[:3])
+        nd = "⚠️ YES" if c["needs_dictionary"] else ""
+        summ = summ.replace("|", "\\|")
+        lines.append(f"| `{c['name']}` | {c['role']} | {c['dtype']} | {c['pct_missing']} | {c['n_unique']} | {summ} | {nd} |")
+    nd_cols = [c["name"] for c in cb["columns"] if c["needs_dictionary"]]
+    if nd_cols:
+        lines += ["", "## Columns requiring dictionary lookup", ""]
+        for name in nd_cols:
+            lines.append(f"- `{name}` — fill level meanings + units from the authoritative data dictionary, then cite per the dictionary-first rule before use in /define-variables.")
+    lines.append("")
+    return "\n".join(lines)
+def main() -> int:
+    ap = argparse.ArgumentParser(description="Generate a data dictionary / codebook from a tabular dataset.")
+    ap.add_argument("data", help="Path to .csv/.tsv/.xlsx/.parquet/.dta/.sas7bdat")
+    ap.add_argument("--out-dir", default=".", help="Output directory (default: cwd)")
+    ap.add_argument("--max-levels", type=int, default=CATEGORICAL_MAX_LEVELS_DEFAULT,
+                    help=f"Max distinct values to treat a column as categorical (default {CATEGORICAL_MAX_LEVELS_DEFAULT})")
+    ap.add_argument("--json-only", action="store_true", help="Write only codebook.json")
+    ap.add_argument("--md-only", action="store_true", help="Write only codebook.md")
+    args = ap.parse_args()
+    data_path = Path(args.data)
+    if not data_path.exists():
+        print(f"ERROR: data file not found: {data_path}", file=sys.stderr)
+        return 2
+    try:
+        df = read_table(data_path)
+    except Exception as e:
+        print(f"ERROR: could not read {data_path}: {e}", file=sys.stderr)
+        return 2
+    cb = build_codebook(df, str(data_path), args.max_levels)
+    out = Path(args.out_dir)
+    out.mkdir(parents=True, exist_ok=True)
+    if not args.md_only:
+        (out / "codebook.json").write_text(json.dumps(cb, indent=2, ensure_ascii=False), encoding="utf-8")
+    if not args.json_only:
+        (out / "codebook.md").write_text(render_md(cb), encoding="utf-8")
+    print(json.dumps({
+        "n_rows": cb["n_rows"],
+        "n_columns": cb["n_columns"],
+        "needs_dictionary_count": cb["needs_dictionary_count"],
+        "outputs": [str(out / "codebook.json")] * (not args.md_only) + [str(out / "codebook.md")] * (not args.json_only),
+    }, indent=2))
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

package/skills/generate-codebook/skill.yml ADDED Viewed

@@ -0,0 +1,35 @@
+schema_version: 2
+name: generate-codebook
+layer: A
+owner_domain: data_documentation
+when_to_use: "Generate a data dictionary and codebook from a dataset."
+when_NOT_to_use: "Cleaning the data (use clean-data); versioning the dataset (use version-dataset)."
+inputs:
+  - "analysis dataset (CSV/Excel)"
+outputs:
+  - "data dictionary"
+  - "codebook"
+deterministic_scripts:
+  - scripts/generate_codebook.py
+side_effects:
+  - writes_codebook_artifacts
+downstream_consumers:
+  - define-variables
+  - version-dataset
+forbidden_actions:
+  - fabricate_variable_descriptions
+  - infer_units_not_present_in_data
+# v2.1 quality card
+purpose: "Derive a structured codebook (variables, types, ranges, missingness) directly from a dataset."
+safety_boundaries:
+  - "Descriptive statistics are computed from the data by the bundled script, not asserted."
+  - "Variable semantics not present in the data are left blank for the researcher, not invented."
+known_limitations:
+  - "Computes structure and distributions; does not supply clinical meaning of variables."
+  - "Free-text/semantic descriptions require researcher input."
+validation_commands:
+  - "python3 scripts/generate_codebook.py <dataset>"
+evidence_surface: bundled_script

package/skills/generate-codebook/tests/test_generate_codebook.sh ADDED Viewed

@@ -0,0 +1,76 @@
+#!/usr/bin/env bash
+# Regression tests for generate-codebook/scripts/generate_codebook.py.
+# Self-contained: builds a synthetic dataset (no committed data) and asserts
+# role inference, the needs_dictionary flag, and the no-hallucination invariant.
+set -uo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
+SCRIPT="$REPO_ROOT/skills/generate-codebook/scripts/generate_codebook.py"
+TMP="$(mktemp -d -t codebook.XXXXXX)"
+trap 'rm -rf "$TMP"' EXIT
+[[ -f "$SCRIPT" ]] || { echo "ENV-ERR: script missing" >&2; exit 2; }
+python3 -c "import pandas, numpy" 2>/dev/null || { echo "SKIP: pandas/numpy not installed"; exit 0; }
+# Build a realistic synthetic dataset (seeded, no real data).
+python3 - "$TMP" <<'PY'
+import sys, numpy as np, pandas as pd
+out = sys.argv[1]
+rng = np.random.default_rng(42); n = 200
+pd.DataFrame({
+    "patient_id": np.arange(10001, 10001+n),
+    "age": rng.integers(30, 85, n),
+    "sex": rng.integers(1, 3, n),                       # coded -> needs_dictionary
+    "fatty_liver_grade": rng.integers(0, 5, n),         # coded -> needs_dictionary
+    "bmi": rng.normal(25, 3, n).round(1),               # continuous
+    "visit_date": pd.to_datetime("2023-01-01") + pd.to_timedelta(rng.integers(0,365,n), unit="D"),
+    "smoking_status": rng.choice(["never","former","current"], n),  # labelled -> NOT needs_dictionary
+}).to_csv(f"{out}/data.csv", index=False)
+PY
+python3 "$SCRIPT" "$TMP/data.csv" --out-dir "$TMP/out" >/dev/null 2>&1
+fail=0; ran=0
+assert() {
+    local label="$1" cond="$2"
+    ran=$((ran+1))
+    if [[ "$cond" == "1" ]]; then printf '  PASS  %s\n' "$label"
+    else printf '  FAIL  %s\n' "$label"; fail=$((fail+1)); fi
+}
+# Outputs exist.
+assert "codebook.json written" "$([[ -f "$TMP/out/codebook.json" ]] && echo 1 || echo 0)"
+assert "codebook.md written"   "$([[ -f "$TMP/out/codebook.md" ]] && echo 1 || echo 0)"
+# Role inference + needs_dictionary + no-hallucination, asserted from JSON.
+while IFS=$'\t' read -r status label; do
+    [[ -z "$label" ]] && continue
+    assert "$label" "$([[ "$status" == "PASS" ]] && echo 1 || echo 0)"
+done < <(python3 - "$TMP/out/codebook.json" <<'PY'
+import json, sys
+cb = json.load(open(sys.argv[1]))
+col = {c["name"]: c for c in cb["columns"]}
+checks = {
+    "role: patient_id=id": col["patient_id"]["role"] == "id",
+    "role: age=continuous": col["age"]["role"] == "continuous",
+    "role: bmi=continuous": col["bmi"]["role"] == "continuous",
+    "role: sex=binary": col["sex"]["role"] == "binary",
+    "role: fatty_liver_grade=categorical": col["fatty_liver_grade"]["role"] == "categorical",
+    "role: visit_date=date": col["visit_date"]["role"] == "date",
+    "role: smoking_status=categorical": col["smoking_status"]["role"] == "categorical",
+    "needs_dict: sex flagged": col["sex"]["needs_dictionary"] is True,
+    "needs_dict: fatty_liver_grade flagged": col["fatty_liver_grade"]["needs_dictionary"] is True,
+    "needs_dict: smoking_status NOT flagged": col["smoking_status"]["needs_dictionary"] is False,
+    "needs_dict: bmi NOT flagged": col["bmi"]["needs_dictionary"] is False,
+    "no-hallucination: labels null": all(c["label"] is None for c in cb["columns"]),
+    "no-hallucination: units null": all(c["units"] is None for c in cb["columns"]),
+    "count: needs_dictionary_count==2": cb["needs_dictionary_count"] == 2,
+}
+for k, v in checks.items():
+    print(("PASS" if v else "FAIL") + "\t" + k)
+PY
+)
+printf '\n%d/%d checks passed\n' "$((ran-fail))" "$ran"
+[[ "$fail" -eq 0 ]] || exit 1