npm - medsci-skills - Versions diffs - 4.7.0 → 4.8.0 - Mend

medsci-skills 4.7.0 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

package/README.md CHANGED Viewed

@@ -11,6 +11,7 @@
 [![CI](https://img.shields.io/github/actions/workflow/status/Aperivue/medsci-skills/validate.yml?branch=main&style=flat-square&label=CI)](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
 ![Skills](https://img.shields.io/badge/Skills-45-brightgreen?style=flat-square)
 [![npm](https://img.shields.io/npm/v/medsci-skills?style=flat-square&label=npm&color=cb3837)](https://www.npmjs.com/package/medsci-skills)
+[![Watch the 2-min intro](https://img.shields.io/badge/▶_Watch-2--min_intro-FF0000?style=flat-square&logo=youtube&logoColor=white)](https://youtu.be/MclQ_RIofpE)
 [![good first issues](https://img.shields.io/github/issues/Aperivue/medsci-skills/good%20first%20issue?style=flat-square&label=good%20first%20issues&color=7057ff)](https://github.com/Aperivue/medsci-skills/contribute)
 [![Agent Skills](https://img.shields.io/badge/Agent_Skills-standard-blue?style=flat-square)](https://agentskills.io)
@@ -267,6 +268,12 @@ The E2E pipeline (`orchestrate --e2e`) produces everything up to `qc/`. The `sub
 ## What's New
+**v4.8** is the **review-harvest batch** — deterministic detector hardening promoted from real-manuscript review cycles. Additive and backward-compatible; still 45 skills / 36 guidelines, analysis-integrity detectors **30 → 32**:
+- **Two new gates** — `check_supplement_hygiene.py` lints the rendered supplement / tables / caption files (not just the manuscript) for §-labels, placeholders, build markers, response-letter framing, and unresolved body↔supplement cross-references; `check_null_calibration.py` flags a headline negative/equivalence claim made without a minimum-detectable-effect / power / equivalence statement.
+- **Four detector false-positive fixes** — gates no longer fire on a recommended colorblind-safe palette, author-footnote `§` daggers, a correctly-hedged disclaimer, or a tier-label digit; each with a regression fixture and three newly CI-wired test suites.
+- **Nine reviewer-side domain probes** (SR/MA, observational, diagnostic, AI-overclaiming, survival) plus a `/design-study` design-stage ceiling gate for perceptual/reader-AI studies and a reusable confidence-weighted-rating→AUC monotonicity template.
 **v4.7** is the **self-update foundation** — physician-researchers stay current without GitHub, git, or a terminal. Additive and backward-compatible; still 45 skills / 36 guidelines / 30 detectors:
 - **Transactional, crash-recoverable installer.** Each install runs through a durable journal state machine recovered on the next run (roll back / forward-clean / fail-closed), with per-target SHA-256 inventories — your modified or third-party skills are backed up and never clobbered or auto-deleted.

package/metadata/distribution_files.json CHANGED Viewed

@@ -358,8 +358,13 @@
     },
     {
       "path": "skills/analyze-stats/scripts/check_generated_code.py",
-      "size": 14525,
-      "sha256": "b9918728e4eac298fdb6adffa2a79faf835b395b65bace69409c190e88c13619"
+      "size": 15024,
+      "sha256": "b8643d9d9a4b600af428604811135fc27cee6eb2140bdae5e393ed9b3be37fae"
+    },
+    {
+      "path": "skills/analyze-stats/scripts/rating_monotonicity.py",
+      "size": 5494,
+      "sha256": "b7f8f36c7af060e2dec60f51185ef4cc5a0f2ef20c1d509fb845b36521f420b3"
     },
     {
       "path": "skills/analyze-stats/skill.yml",
@@ -868,8 +873,8 @@
     },
     {
       "path": "skills/design-study/SKILL.md",
-      "size": 15849,
-      "sha256": "41e461909bbf56f0cb810c6d0fe809380b70125911bbea014d9d7b8b397a13e9"
+      "size": 18282,
+      "sha256": "dfd73871102af6beab3d21f60f650db41ec2e4025d6ee19eae8de61b17851fd4"
     },
     {
       "path": "skills/design-study/skill.yml",
@@ -2363,8 +2368,8 @@
     },
     {
       "path": "skills/peer-review/SKILL.md",
-      "size": 50226,
-      "sha256": "48e0841a8fc2364c54964ae2346c13fb14978c87478ae1ed37964f728e65f02d"
+      "size": 50304,
+      "sha256": "2cabbeb0695e52e6a4aa1f341a0ddeb3cfe088c53e25c9803a2b1b5f3a0b5bd1"
     },
     {
       "path": "skills/peer-review/references/aczel_2021_reviewer2_patterns.md",
@@ -2373,8 +2378,8 @@
     },
     {
       "path": "skills/peer-review/references/domain-probes/ai_overclaiming.md",
-      "size": 12790,
-      "sha256": "cda5e4c56c37c37e036565f94d05487b8aa2d1c90b2a6a5b2e3113171456b6d7"
+      "size": 14205,
+      "sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
     },
     {
       "path": "skills/peer-review/references/domain-probes/case_report.md",
@@ -2388,8 +2393,8 @@
     },
     {
       "path": "skills/peer-review/references/domain-probes/diagnostic_accuracy.md",
-      "size": 7673,
-      "sha256": "292a3cefe72e55856084dc2d0ba7cd46ec60d8f52fd488ea57b72534f7ee39f7"
+      "size": 9002,
+      "sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
     },
     {
       "path": "skills/peer-review/references/domain-probes/equity_fairness.md",
@@ -2408,8 +2413,8 @@
     },
     {
       "path": "skills/peer-review/references/domain-probes/observational_confounding.md",
-      "size": 24897,
-      "sha256": "8c21c2b0cde6d46dfdc9757571aeb480e0a692d30aea19d56b06654bbd3ba803"
+      "size": 27378,
+      "sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
     },
     {
       "path": "skills/peer-review/references/domain-probes/radiomics.md",
@@ -2423,13 +2428,13 @@
     },
     {
       "path": "skills/peer-review/references/domain-probes/sr_ma.md",
-      "size": 13933,
-      "sha256": "bcd74a02e89aec5c0436297b93181cd256f7f3a0d676843c214e330fa8d3b036"
+      "size": 17239,
+      "sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
     },
     {
       "path": "skills/peer-review/references/domain-probes/survival_prognostic.md",
-      "size": 13073,
-      "sha256": "0a5922aa7ca389089de394b3fa569565600384ceabae0bd25bfd50efa901cd13"
+      "size": 13765,
+      "sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
     },
     {
       "path": "skills/peer-review/references/exemplar_reviews/README.md",
@@ -2843,13 +2848,13 @@
     },
     {
       "path": "skills/self-review/SKILL.md",
-      "size": 84876,
-      "sha256": "e4d5cc5d0e75d8d13923cedf5d9260a759077f1ad86d68255f0f9323e8d11616"
+      "size": 89813,
+      "sha256": "54bcd9c6e751555044b9b436db9e8f10e0be280b60c00048c68de5570512bf16"
     },
     {
       "path": "skills/self-review/references/domain-probes/ai_overclaiming.md",
-      "size": 12790,
-      "sha256": "cda5e4c56c37c37e036565f94d05487b8aa2d1c90b2a6a5b2e3113171456b6d7"
+      "size": 14205,
+      "sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
     },
     {
       "path": "skills/self-review/references/domain-probes/case_report.md",
@@ -2863,8 +2868,8 @@
     },
     {
       "path": "skills/self-review/references/domain-probes/diagnostic_accuracy.md",
-      "size": 7673,
-      "sha256": "292a3cefe72e55856084dc2d0ba7cd46ec60d8f52fd488ea57b72534f7ee39f7"
+      "size": 9002,
+      "sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
     },
     {
       "path": "skills/self-review/references/domain-probes/equity_fairness.md",
@@ -2883,8 +2888,8 @@
     },
     {
       "path": "skills/self-review/references/domain-probes/observational_confounding.md",
-      "size": 24897,
-      "sha256": "8c21c2b0cde6d46dfdc9757571aeb480e0a692d30aea19d56b06654bbd3ba803"
+      "size": 27378,
+      "sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
     },
     {
       "path": "skills/self-review/references/domain-probes/radiomics.md",
@@ -2898,13 +2903,13 @@
     },
     {
       "path": "skills/self-review/references/domain-probes/sr_ma.md",
-      "size": 13933,
-      "sha256": "bcd74a02e89aec5c0436297b93181cd256f7f3a0d676843c214e330fa8d3b036"
+      "size": 17239,
+      "sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
     },
     {
       "path": "skills/self-review/references/domain-probes/survival_prognostic.md",
-      "size": 13073,
-      "sha256": "0a5922aa7ca389089de394b3fa569565600384ceabae0bd25bfd50efa901cd13"
+      "size": 13765,
+      "sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
     },
     {
       "path": "skills/self-review/references/exemplar_findings/README.md",
@@ -2953,8 +2958,8 @@
     },
     {
       "path": "skills/self-review/scripts/check_artifact_coverage.py",
-      "size": 12440,
-      "sha256": "44a935c657bb597e3596c2e361c8d8ee89c613cb020715f27792cf1844731066"
+      "size": 17113,
+      "sha256": "56096c39ddb0083c04a1254f06bafa6fac9fc8a136c9246f68773f0ba5da96d4"
     },
     {
       "path": "skills/self-review/scripts/check_claim_artifact.py",
@@ -2963,19 +2968,24 @@
     },
     {
       "path": "skills/self-review/scripts/check_classical_style.py",
-      "size": 10067,
-      "sha256": "5cf7f9a331a4ebcb6f8523fcb5fd1b872079e3bb6e00bf9821a03d504174fdaa"
+      "size": 10953,
+      "sha256": "3b5c85edc57ee607a2b0a10898d68ac163ce3be6fe50b82c8c11490d7bc2705a"
     },
     {
       "path": "skills/self-review/scripts/check_cohort_arithmetic.py",
-      "size": 23363,
-      "sha256": "4ef7bded82091bc6cac8bf35cfa19bc1651acb7e0df2aa8b742bd7a04c8b3991"
+      "size": 23868,
+      "sha256": "55933d25d824bf073c0a8b7abc42d2f160c459a288ab932c50d63cf8c03afd37"
     },
     {
       "path": "skills/self-review/scripts/check_confounding_completeness.py",
       "size": 21506,
       "sha256": "7d3e67074d58a28ffee52ce64b486231f103a3ddcaf6b3b6ee83ba5f89c63bc2"
     },
+    {
+      "path": "skills/self-review/scripts/check_null_calibration.py",
+      "size": 7594,
+      "sha256": "9ddff01722c34efb6ffd757ae762c6ee12f5993bf13b11313c2e20b60b26cab3"
+    },
     {
       "path": "skills/self-review/scripts/check_panel_diversity.py",
       "size": 15324,
@@ -2998,8 +3008,13 @@
     },
     {
       "path": "skills/self-review/scripts/check_scope_coherence.py",
-      "size": 9538,
-      "sha256": "45915b2736f0356e1b2e827939090e5034a01ba5c310b952dd13ad6f2df45a61"
+      "size": 10818,
+      "sha256": "820dfc264c2a4f62c79c0c7123a3e1a8b59a100b89654617a08ff55deeb25a75"
+    },
+    {
+      "path": "skills/self-review/scripts/check_supplement_hygiene.py",
+      "size": 11088,
+      "sha256": "f89027472cdf0258357c3b0f0b0f3fec09b5ea65cc1373292797b818d1acf444"
     },
     {
       "path": "skills/self-review/skill.yml",

package/metadata/distribution_manifest.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "schema_version": 1,
-  "version": "4.7.0",
+  "version": "4.8.0",
   "owned_skills": [
     "academic-aio",
     "add-journal",

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "medsci-skills",
-  "version": "4.7.0",
+  "version": "4.8.0",
   "description": "MedSci Skills — a medical/scientific research skill suite for AI coding agents (Claude Code, Codex, Cursor, Copilot). The npm package is a terminal-friendly installer shortcut; the canonical distribution remains the GitHub repository and the Claude Code plugin marketplace.",
   "license": "SEE LICENSE IN LICENSE",
   "homepage": "https://github.com/Aperivue/medsci-skills#readme",

package/skills/analyze-stats/scripts/check_generated_code.py CHANGED Viewed

@@ -145,9 +145,15 @@ def check_text_common(src: str, lang: str) -> list[dict]:
     for m in NUM_LITERAL_BODY.finditer(src):
         body = m.group(1)
         # ignore obvious non-data: ranges, single repeated, function-call args with kwargs
-        nums = NUM_TOKEN.findall(body)
         if "=" in body:  # kwargs like figsize=(8,6) or linspace(0,1,...) — not table data
             continue
+        # A list/tuple of string literals (e.g. a hex-color palette
+        # ['#000000','#E69F00',...] — exactly the colorblind-safe WONG palette that
+        # make-figures recommends) is NOT hand-typed tabular data. Strip quoted
+        # substrings before counting numeric tokens, so digits living inside string
+        # literals (the "00" in '#E69F00', RGBA codes, category labels) don't make a
+        # string list look table-shaped. Genuine numeric data is unquoted.
+        nums = NUM_TOKEN.findall(STR_LITERAL.sub("", body))
         if len(nums) >= DATA_LITERAL_STANDALONE or (len(nums) >= DATA_LITERAL_MIN and has_read):
             ln = src[:m.start()].count("\n") + 1
             claims.append({

package/skills/analyze-stats/scripts/rating_monotonicity.py ADDED Viewed

@@ -0,0 +1,132 @@
+#!/usr/bin/env python3
+"""Confidence-weighted rating → AUC monotonicity probe (reusable template).
+NOT an auto-discovered detector — it needs the score *encoding* as input, so it is
+a template/helper you point at your own score definition, not a manuscript scanner.
+When an observer / reader study collapses a (binary-call × confidence) rating into a
+single score used as the ROC/AUC predictor, that score MUST be strictly monotonic in
+"evidence for the positive label" across the full ladder:
+    negative-call, highest confidence   = strongest evidence AGAINST positive = LOWEST
+    …
+    negative-call, lowest confidence
+    positive-call, lowest confidence
+    …
+    positive-call, highest confidence    = strongest evidence FOR positive = HIGHEST
+A *folded* score — the classic bug `cws = confidence if positive_call else (K+1) − confidence`
+— makes negative/high-confidence collide with positive/low-confidence and is NOT
+monotonic; it understates discrimination and can flip a gradient to equivalence. Prose
+review cannot see this; re-checking the encoding does.
+Input JSON (`--encoding score_def.json`):
+    {
+      "confidence_levels": [1, 2, 3, 4, 5],
+      "scores": {
+        "positive": {"1": 6, "2": 7, "3": 8, "4": 9, "5": 10},
+        "negative": {"1": 5, "2": 4, "3": 3, "4": 2, "5": 1}
+      }
+    }
+`scores[call][confidence]` is the numeric value fed to the ROC/AUC routine for that cell.
+Exit codes: 0 monotonic, 1 a collision/inversion (folded or mis-encoded), 2 usage.
+Stdlib-only (json / argparse / sys). Run `--demo` for the 10-combination unit test.
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+def evidence_order(levels: list) -> list[tuple[str, object]]:
+    """Cells ordered from strongest-against-positive to strongest-for-positive."""
+    asc = sorted(levels, key=lambda x: float(x))
+    # negative call: highest confidence first (strongest against) → lowest
+    neg = [("negative", c) for c in reversed(asc)]
+    # positive call: lowest confidence first → highest (strongest for)
+    pos = [("positive", c) for c in asc]
+    return neg + pos
+def check_encoding(spec: dict) -> dict:
+    levels = spec["confidence_levels"]
+    scores = spec["scores"]
+    order = evidence_order(levels)
+    seq = []
+    for call, conf in order:
+        # JSON object keys are strings; tolerate int/str confidence keys.
+        cell = scores[call]
+        val = cell.get(str(conf), cell.get(conf))
+        if val is None:
+            return {"ok": False, "problems": [f"missing score for {call}/conf {conf}"],
+                    "sequence": seq}
+        seq.append({"call": call, "confidence": conf, "score": val})
+    problems = []
+    for a, b in zip(seq, seq[1:]):
+        if b["score"] == a["score"]:
+            problems.append(
+                f"collision: {a['call']}/{a['confidence']} and {b['call']}/{b['confidence']} "
+                f"share score {a['score']} (a folded/mirrored encoding maps opposite evidence "
+                f"to the same value)")
+        elif b["score"] < a["score"]:
+            problems.append(
+                f"inversion: {a['call']}/{a['confidence']} (={a['score']}) ranks above "
+                f"{b['call']}/{b['confidence']} (={b['score']}) but carries weaker evidence "
+                f"for the positive label")
+    return {"ok": not problems, "problems": problems,
+            "sequence": [(s["call"], s["confidence"], s["score"]) for s in seq]}
+def _demo() -> int:
+    """The 10-combination (K=5) unit test: a correct directional encoding passes,
+    the folded encoding fails on collisions."""
+    levels = [1, 2, 3, 4, 5]
+    correct = {"confidence_levels": levels,
+               "scores": {"positive": {str(c): 5 + c for c in levels},
+                          "negative": {str(c): 6 - c for c in levels}}}
+    folded = {"confidence_levels": levels,
+              "scores": {"positive": {str(c): c for c in levels},
+                         "negative": {str(c): 6 - c for c in levels}}}
+    rc = check_encoding(correct)
+    rf = check_encoding(folded)
+    ok = rc["ok"] and not rf["ok"]
+    print(f"correct directional encoding: monotonic={rc['ok']} (expected True)")
+    print(f"folded encoding:              monotonic={rf['ok']} (expected False)")
+    if rf["problems"]:
+        print(f"  folded problems[0]: {rf['problems'][0]}")
+    print("DEMO PASS" if ok else "DEMO FAIL")
+    return 0 if ok else 1
+def main() -> int:
+    ap = argparse.ArgumentParser(description="Confidence-weighted rating→AUC monotonicity probe.")
+    ap.add_argument("--encoding", help="JSON score-definition file (see module docstring)")
+    ap.add_argument("--demo", action="store_true", help="run the 10-combination unit test")
+    args = ap.parse_args()
+    if args.demo:
+        return _demo()
+    if not args.encoding:
+        sys.stderr.write("ERROR: --encoding FILE or --demo is required\n")
+        return 2
+    try:
+        spec = json.loads(open(args.encoding, encoding="utf-8").read())
+    except (OSError, ValueError) as e:
+        sys.stderr.write(f"ERROR: cannot read encoding: {e}\n")
+        return 2
+    res = check_encoding(spec)
+    if res["ok"]:
+        print("OK: the (call × confidence) → score encoding is strictly monotonic.")
+        return 0
+    print("NON-MONOTONIC encoding (folded or mis-ordered) — this mis-estimates the AUC:")
+    for p in res["problems"]:
+        print(f"  - {p}")
+    return 1
+if __name__ == "__main__":
+    sys.exit(main())

package/skills/analyze-stats/tests/fixtures/gen_palette.py ADDED Viewed

@@ -0,0 +1,22 @@
+"""
+Analysis: synthetic CLEAN fixture exercising the colorblind-safe palette.
+A hex-color list must NOT be flagged HARDCODED_DATA_LITERAL even alongside a
+data-file read (regression for the WONG-palette false positive).
+Date: 2026-01-01
+Random seed: 42
+"""
+import numpy as np
+import pandas as pd
+np.random.seed(42)
+# The Wong (2011) colorblind-safe palette that make-figures recommends. Eight
+# string literals; the digits live inside the hex codes, not in tabular data.
+WONG = ["#000000", "#E69F00", "#56B4E9", "#009E73",
+        "#F0E442", "#0072B2", "#D55E00", "#CC79A7"]
+# portable relative path; no hand-typed numeric data
+df = pd.read_csv("cohort.csv")
+means = df.groupby("group")["auc"].mean()
+print(WONG[0], float(means.iloc[0]))

package/skills/analyze-stats/tests/test_generated_code.sh CHANGED Viewed

@@ -55,5 +55,17 @@ check "exit 0 (clean .py)" test "$?" -eq 0
 python3 "$SCRIPT" --code-dir "$HERE/fixtures" --strict --quiet >/dev/null 2>&1
 check "exit 1 (--code-dir scan)" test "$?" -eq 1
+# (5) hex-color palette + data read -> NOT HARDCODED_DATA_LITERAL (WONG-palette
+#     false-positive regression); the script is otherwise clean -> exit 0
+PALETTE="$HERE/fixtures/gen_palette.py"
+python3 "$SCRIPT" "$PALETTE" --out "$OUT" --quiet >/dev/null 2>&1
+check "no HARDCODED_DATA_LITERAL on hex-color palette" python3 -c "
+import json
+d=json.load(open('$OUT'))
+assert not any(c['verdict']=='HARDCODED_DATA_LITERAL' for c in d['claims']), 'palette flagged as data literal'
+"
+python3 "$SCRIPT" "$PALETTE" --strict --quiet >/dev/null 2>&1
+check "exit 0 on clean palette script" test "$?" -eq 0
 echo "fail=$fail"; [[ "$fail" -eq 0 ]] && echo "ALL PASS" || echo "FAILURES: $fail"
 exit "$fail"

package/skills/design-study/SKILL.md CHANGED Viewed

@@ -195,6 +195,39 @@ For an AI-system-versus-human-expert benchmark specifically, route to `/design-a
 extends this subsection with arm definition, LLM-as-judge versus human-as-judge adjudication, and a
 structured export schema.
+**Perceptual / reader AI study — design-stage ceiling gate**
+For a reader/observer/perceptual or diagnostic-accuracy AI study (visual Turing test, AI-vs-human
+detection, image-provenance/deepfake, observer study), the acceptance ceiling is fixed **at design
+time, not at analysis time** — excellent execution cannot lift a ceiling baked into the comparator,
+the estimand, or the reader cohort. Walk these six before data lock and, for each, take the
+higher-ambition option or record an explicit, defensible reason not to (set each at the impact level
+of the journal you actually want):
+1. **Comparator realism (biggest lever).** A curated teaching-repository "authentic" arm scopes the
+   claim to "teaching-quality", not clinical. Use consecutive, de-identified clinical-acquisition
+   images (the real PACS spectrum), or add a clinical-spectrum validation arm.
+2. **Format / non-content confound matching.** Match every non-content attribute (aspect ratio,
+   resolution, compression, color profile) across arms by construction, and pre-specify a
+   confound-classifier ceiling check (format-only AUC must be ≪ reader AUC) as a *primary* gate.
+3. **Synthetic / index-arm denominator (survivorship).** Pre-specify how failed/low-quality
+   generations are counted; report the full generation denominator rather than evaluating only the
+   convincing survivors.
+4. **Reader independence and breadth.** Recruit an independent, non-author, multi-site (ideally
+   multi-national) reader cohort; collect reader characteristics; blind readers to the hypothesis
+   where feasible.
+5. **Estimand and power (generalize, don't condition).** Power the reader-AND-case generalization as
+   the **primary** estimand from the start, so the two-way interval — not a pool-conditional number —
+   supports the headline claim.
+6. **Novelty positioning vs scoop, and venue-fit.** Scan for close prior work at design time; if a
+   flagship precedent exists, make the differentiation categorical (new modality class, clinical
+   spectrum, outcome linkage), not incremental; pick the venue whose audience values the likely
+   result (a rigorous null fits a methodology-forward journal better than an impact-first one).
+The meta-rule: set the comparator, the confound-matching, the reader cohort, and the estimand at the
+target journal's impact level **before** data collection — do not plan to out-write a structural
+ceiling in revision.
 ### Phase 3: Clinical framing
 Ask whether the comparator and endpoint support the stated claim:

package/skills/peer-review/SKILL.md CHANGED Viewed

@@ -135,11 +135,11 @@ confidential note and the recommendation are consistent.
 ### Phase 2A: Systematic Review / Meta-Analysis Extension
-Apply this internal-consistency-first gate (P0) plus 10-probe checklist (P1–P10) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
+Apply this internal-consistency-first gate (P0) plus 17-probe checklist (P1–P17) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
 **SR-MA reviews almost always justify Tier 3 word budget** (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.
-**Probe detail (P0–P10), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
+**Probe detail (P0–P17), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
 ### Phase 2B: Survival / Prognostic Model Extension
@@ -182,14 +182,14 @@ The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not tr
 ### Phase 2E: Observational / Confounding Extension
-Apply this 14-probe checklist (O1–O14) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
+Apply this 16-probe checklist (O1–O16) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
 **Exempt**:
 - Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
 - Purely descriptive / prevalence reports with no adjusted association claim
 - Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)
-**Probe detail (O1–O14), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
+**Probe detail (O1–O16), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
 ### Phase 2E-2: Clinical Prediction-Model Extension
@@ -201,7 +201,7 @@ Apply this 4-probe checklist (CP1–CP4) **only when the manuscript develops or
 Apply when an AI/ML **primary study** (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.
-**Probe detail (AO0–AO5), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the `exemplar_reviews/optimistic_validation_reporting.md` phrasing model and raise it as Major when it carries the headline.
+**Probe detail (AO0–AO6), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the `exemplar_reviews/optimistic_validation_reporting.md` phrasing model and raise it as Major when it carries the headline.
 ### Phase 2H: RCT / Intervention-Trial Extension
@@ -211,9 +211,9 @@ Apply this 8-probe checklist (RC0–RC7) **only when the manuscript is a randomi
 ### Phase 2I: Diagnostic-Accuracy / Reader-Study Extension
-Apply this 6-probe checklist (D1–D6) **only when the manuscript is a diagnostic test accuracy (DTA) primary study** — an index test against a reference standard — including **multi-reader multi-case (MRMC)** reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA **meta-analysis**, use Phase 2A / `sr_ma.md`.)
+Apply this 8-probe checklist (D1–D8) **only when the manuscript is a diagnostic test accuracy (DTA) primary study** — an index test against a reference standard — including **multi-reader multi-case (MRMC)** reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA **meta-analysis**, use Phase 2A / `sr_ma.md`.)
-**Probe detail (D1–D6), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md`. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the `analyze-stats` `table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure; a test-set-tuned operating threshold pairs with `exemplar_reviews/optimistic_validation_reporting.md`.
+**Probe detail (D1–D8), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md`. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the `analyze-stats` `table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure; a test-set-tuned operating threshold pairs with `exemplar_reviews/optimistic_validation_reporting.md`.
 ### Phase 2J: Case-Report Extension
@@ -336,7 +336,7 @@ After drafting, verify mechanically:
    - ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
    - General Comments names ≥2 specific strengths before listing concerns
    - At most 1 typo/grammar Minor Comment, only if in formal section or systematic
-9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P10 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
+9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P17 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
 10. **Radiomics-reproducibility QC** (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
 11. **Review-article QC** (if Phase 2D applied): Confirm RV1–RV9 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines). Confirm Phase 2F was run for the recommendation: when RV1 novelty is a Major in a saturated space with no distinct contribution, the recommendation is escalated toward Reject (the contribution IS the product — weak novelty is unfixable-in-current-form), not defaulted to the revision/Reconsider tier.
 12. **AI/method/review priority QC**: Before a Major Revision (or Reconsider) recommendation, confirm Phase 2F
@@ -406,7 +406,7 @@ For radiomic feature-reproducibility / phantom parameter-sweep / reliability-fil
 For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 9-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling, curated-base circularity) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.
-For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 12-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
+For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 16-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale, selection on modality/procedure availability, serial-imaging lesion-tracking), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
 For cross-modality image-synthesis manuscripts (MRI→PET / MRI→CT / non-contrast→contrast / low-dose→full-dose) that claim functional/molecular information or a substitute for the unavailable target modality, also apply the Phase 2K 4-probe audit (IS1 determinism/information-ceiling vs a source→label baseline, IS2 target-derived-preprocessing/slice-selection leakage, IS3 global vs lesion-level quantitative agreement, IS4 mechanistic/proxy-signal plausibility); IS2 and IS4 are typically unfixable-in-current-form and govern the recommendation per Phase 2F.

package/skills/peer-review/references/domain-probes/ai_overclaiming.md CHANGED Viewed

@@ -6,7 +6,7 @@
        - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
      Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
-# AI / ML overclaiming probes (AO0–AO5)
+# AI / ML overclaiming probes (AO0–AO6)
 A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary studies (diagnostic, prognostic, triage, detection) where the **conclusion's reach exceeds the evidence**. These probes complement (do not replace) the generic Phase 2 issue checklist and the signature "Overclaiming vs evidence level" check. The aim is to keep a framing-level over-reach from passing as a wording nitpick: a paper can report sound metrics yet draw a clinical claim — generalizable, outperforms clinicians, deployment-ready — that the design does not support, and that claim is what a reader carries away. AO1–AO4 target over-reach in the *claim sentences*; AO5 targets over-reach baked into the *reported metric itself* (an optimistically- or unreproducibly-reported number that makes the result look stronger than a faithful estimate). Run AO0 first.
@@ -43,6 +43,12 @@ A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary st
 - (d) **Code-vs-claims fidelity**: where code is released, does the described tuning/metric match it? Common mismatches: a claimed hyperparameter search the code does not run; a metric (e.g., specificity) attributed to a library function that does not compute it. A confirmed mismatch is an integrity/reproducibility flag — verify against the released code before asserting it.
 - Severity: MAJOR when the load-bearing performance claim rests on a best-fold number, an unstated/test-tuned threshold, a rebalanced-accuracy headline, or a code-vs-claims mismatch (the reported result is optimistic or not reproducible); MINOR when cross-validation was sound and only the cross-fold summary, the operating point, or a class-aware metric is missing from the write-up.
+**AO6 — Arm-defining task vs deployment workflow (construct validity of the evaluation)**:
+- Distinct from AO3 (model-task ≠ human-task *framing*) and from scope-coherence (claim ≠ result): AO6 asks whether the **task that defines the study arms mirrors the deployment workflow the claim targets**, or an artificial handicap/selection. Two recurrent failure modes:
+  - (a) **Handicapped arm** — the AI (or comparator) arm is operationalized in a way the real workflow never imposes: e.g., AI read in pure blind interpretation while the actual deployment provides clinical context / priors / the report, so the evaluation measures a task no one performs.
+  - (b) **Success-conditioned selection** — the arm or the analyzed subset is gated on an AI-success condition (cases where the model produced an output, segmentations that "passed", studies the pipeline did not fail on), so the comparison is conditioned on the very thing under test.
+- This is a **design/paradigm-level** defect: the operationalized task, not the prose, is mis-specified, so it cannot be fixed by rewording the claim — escalate **past an ordinary Major** (editors read it as a Reject-grade construct-validity failure; a panel that files it as a fixable Major under-rates it). The fix is a re-designed arm whose task matches the intended deployment workflow and an unconditioned (consecutive / intention-to-diagnose) analysis set.
 ## Decision-impact / early-deployment probes (DECIDE-AI axis, DI1–DI5)
 Co-apply when a study claims **clinical utility, deployment, or decision impact** of an AI system, or *is* an early-stage live clinical evaluation. The reporting axis is then DECIDE-AI (early-stage clinical evaluation of AI decision-support); these probes check that a utility/deployment claim rests on real-use evidence, not retrospective accuracy. They sharpen AO4 for the deployment-evaluation case.

package/skills/peer-review/references/domain-probes/diagnostic_accuracy.md CHANGED Viewed

@@ -6,7 +6,7 @@
        - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
      Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
-# Diagnostic-accuracy / reader-study probes (D1–D6)
+# Diagnostic-accuracy / reader-study probes (D1–D8)
 A checklist for **diagnostic test accuracy (DTA) primary studies** — an index test against a reference standard, including **multi-reader multi-case (MRMC)** reader studies (e.g., AI-vs-reader or modality-comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target the biases QUADAS-2 names and the MRMC design/variance issues a reader study adds. Pairs the `analyze-stats` `table-standards/table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure. (For a DTA **meta-analysis**, use `sr_ma.md`.)
@@ -38,9 +38,14 @@ A checklist for **diagnostic test accuracy (DTA) primary studies** — an index
 - Typical signatures: "patients were included if [index score] ≥ T" in a paper whose aim is to evaluate that index; enrolling on a positive screening test to then "validate" the screening test; defining the diseased group by the same reader/algorithm output under study.
 - This is **design-level, not a reporting fix** — escalate past an ordinary Major (a co-reviewer / editor reads it as a fatal selection/spectrum artifact). The fix is a reference standard and an enrollment criterion that are independent of the index test (a consecutive suspected-disease series), not a re-analysis.
+**D8 — Exclusion flow-diagram ↔ Methods-prose consistency + modality-safety enumeration**:
+- Cross-check the **exclusion criteria drawn in the participant flow diagram** (STROBE/STARD flow) against the **exclusion list in the Methods prose**. A criterion that appears in one but not the other (a count in the flow with no prose rationale, or a prose exclusion not reflected in the flow boxes) is a reporting inconsistency a co-reviewer catches by reading the two side by side.
+- For an **imaging** study, check that **modality-specific safety contraindications** and **device/artifact exclusions** are enumerated where applicable: MR safety (pacemaker/implant, claustrophobia), iodinated/gadolinium **contrast** contraindications (renal function, allergy, pregnancy), and **image-quality/artifact** exclusions (motion, metal artifact, incomplete coverage). Silent omission of these categories in a prospective imaging cohort understates the selected spectrum.
+- Severity: a flow-vs-prose exclusion mismatch is MAJOR when it changes the analytic-N or the eligible spectrum; missing modality-safety/artifact exclusion categories is MINOR–MAJOR depending on how much of the source population they remove. The fix is a reconciled exclusion list (flow == prose) plus an explicit modality-safety/artifact exclusion enumeration.
 **Output template (D2 / D6 example)**:
 > "The study uses a case-control (two-gate) design — confirmed cases versus healthy controls — rather than a consecutive series of patients in whom the diagnosis was suspected. This typically overestimates accuracy and does not reflect the intended-use spectrum, so I'd read the reported sensitivity/specificity as proof-of-concept rather than clinical accuracy, and suggest tempering the Abstract accordingly. Separately, the reader study reports a single reader-averaged AUC; because readers are a sample, I'd suggest an MRMC analysis (e.g., Obuchowski–Rockette) that accounts for both reader and case variance, with per-reader estimates shown and the unit of analysis (per-patient vs per-lesion) stated."
-**Discipline — leads vs findings (applies to D1–D6)**:
+**Discipline — leads vs findings (applies to D1–D8)**:
 - A verification/blinding/spectrum concern from a quick scan is a **lead until Methods and the participant flow are read together** — distinguish under-reporting (ask to clarify) from a true design bias (MAJOR).
 - Anchor each comment to the exact bias (partial vs differential verification; single- vs two-gate; reader-averaged vs fixed-reader; per-patient vs per-lesion) and the location. Keep severity tied to what the flaw does: two-gate sampling, incorporation bias, or ignoring reader variance is design/analysis-level (MAJOR, often Major #1); an unreported reading-order detail is a clarify-request.