npm - medsci-skills - Versions diffs - 4.1.0 - Mend

medsci-skills 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (702) hide show

package/skills/self-review/references/domain-probes/ai_overclaiming.md ADDED Viewed

@@ -0,0 +1,47 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# AI / ML overclaiming probes (AO0–AO4)
+A 4-probe checklist for medical-AI/ML primary studies (diagnostic, prognostic, triage, detection) where the **conclusion's reach exceeds the evidence**. These probes complement (do not replace) the generic Phase 2 issue checklist and the signature "Overclaiming vs evidence level" check. The aim is to keep a framing-level over-reach from passing as a wording nitpick: a paper can report sound metrics yet draw a clinical claim — generalizable, outperforms clinicians, deployment-ready — that the design does not support, and that claim is what a reader carries away. Run AO0 first.
+**AO0 — Locate the strongest claim, then its support (run before AO1; gates any over-reach finding)**:
+- Identify the load-bearing claims in the Title, Abstract, and Conclusion (the sentences a reader quotes). For each, find the specific evidence cited (which dataset, which comparison, which metric + uncertainty).
+- An over-reach finding is a **lead until the claim and its support are read together against the manuscript** — do not strawman a stray adjective. Escalate only when a headline claim genuinely outruns the cited evidence.
+- If the claim is already appropriately hedged to the evidence, record "claim matched to evidence" and move on.
+**AO1 — Generalizability claimed from limited external validation**:
+- Does the Abstract/Conclusion assert the model "generalizes," is "transferable/robust across settings," or is suitable for broad populations, while external validation is a single site / single scanner-vendor / single source (or absent)?
+- Sub-check: is the external set demographically narrow (single ethnicity, single sex-dominant, narrow age) relative to the population the claim names?
+- If the generalizability claim outruns the external evidence → recommend softening to the evidence ("validated at one external site") and moving multi-setting generalizability to a stated limitation + next step. MAJOR candidate when it is a headline claim; MINOR when it is a single qualifier in the Discussion.
+**AO2 — Superiority language against overlapping or under-powered comparison**:
+- Flag "outperforms", "superior to", "beats", "can replace [clinician/radiologist]" when (a) the model vs comparator 95% CIs overlap, (b) no test of the *difference* is reported (two separate AUCs are not a comparison), or (c) the comparison rests on a small test set / few readers.
+- Ask for the difference in the metric with its CI and a paired test of that difference, not two standalone estimates.
+- If the difference is not statistically supported → recommend reframing from "outperforms" to "comparable to" (still a meaningful result). MAJOR when a superiority/replacement claim is the headline; otherwise MINOR.
+**AO3 — Comparison-frame mismatch (model task ≠ human task)**:
+- When a model-vs-clinician comparison drives a claim, verify the two performed the **same task on the same inputs under the same constraints**: same images/inputs available, same time budget, same question asked, same decision point.
+- Common mismatches: the model sees a curated single view while readers see the full study; readers are timed or work from a different modality; the "reader" benchmark is a literature value on a different cohort.
+- A mismatch makes "outperforms clinicians" non-interpretable as a clinical claim → ask the authors to state exactly which task the comparison establishes, or to align the conditions. MAJOR candidate when it underpins a headline.
+**AO4 — Deployment / clinical-readiness claim from retrospective internal evidence**:
+- Flag "ready for clinical deployment", "can be used to triage/guide treatment", "will reduce workload/cost", or a recommended decision threshold, when the evidence is a retrospective, internally-split (or even external but observational) accuracy study with no prospective, silent-trial, or decision-impact data and (often) no calibration or decision-curve analysis.
+- Discrimination on retrospective data does not establish that acting on the model helps patients; a probability that drives a decision must also be calibrated, and net benefit must be shown.
+- Recommend reframing deployment/utility language to "supports further prospective evaluation", and (where a threshold is proposed) adding calibration + decision-curve evidence. MAJOR when a deployment/care-directive claim is made; MINOR when only a hedged "potential utility" sentence.
+**Output template (AO1 example)**:
+> "The Conclusion states the model 'generalizes across institutions,' but external validation appears limited to a single site ([Methods, External validation]). I'd suggest softening this to the evidence — e.g., 'validated at one external site' — and framing multi-institution generalizability as a stated limitation and a next step. If a broader claim is intended, an external set spanning multiple sites/vendors would be needed to support it."
+**Output template (AO2 / AO3 example)**:
+> "The 'outperforms radiologists' claim rests on a comparison whose 95% CIs for model and reader [metric] overlap ([Figure/Table]), and no test of the difference is reported; the reader task also differs from the model's in [inputs/time] ([Methods/Table]). I'd suggest (a) reporting the difference in [metric] with its CI and a paired test rather than two separate estimates, and (b) stating explicitly which clinical task the comparison establishes. If the difference is not statistically supported, reframing from 'outperforms' to 'comparable to' would be both defensible and still a meaningful result."
+**Discipline — leads vs findings (applies to AO0–AO4)**:
+- A claim-vs-evidence mismatch surfaced by a quick scan is a **lead, not a finding, until the claim sentence and its cited support are read together** against the manuscript. Do not escalate a hedged Discussion qualifier as if it were a headline.
+- Anchor every over-reach comment to the exact claim location and the exact evidence (dataset, comparison, metric + CI). A comment that names the location and the gap is actionable; "the authors overclaim" is not.
+- Keep severity tied to *where* the claim sits and *what it drives*: a headline/clinical-action claim that outruns the design is design-/framing-level (MAJOR, often Major #1); a stray adjective is MINOR.

package/skills/self-review/references/domain-probes/narrative_review.md ADDED Viewed

@@ -0,0 +1,44 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Narrative / Review-Article probes (RV1–RV8)
+An 8-probe checklist for a Review / narrative review / primer / state-of-the-art / educational review — i.e., a non-systematic synthesis rather than original research. Supporting appraisal material (the SANRA appraisal items, a consolidated evaluation checklist, and a candidate-additions catalog for AI/LLM-in-radiology reviews) is maintained separately by the peer-review skill and is not required to apply RV1–RV8 below.
+The original-research probes (the generic Phase 2 issue checklist, and the SR-MA / Survival / Radiomics probes) do not transfer to review articles. The key inversion: for original research, reviewers are discouraged from scope-expanding requests, but **for narrative reviews, identifying thematic gaps and proportionately suggesting missing content is an expected part of the reviewer's role** — error-spotting alone is necessary but not sufficient. Keep SANRA in its lane: it is a 6-item *critical appraisal tool, not a reporting guideline*, so do not over-enforce it (only RV3 is SANRA-aligned, and as a suggestion; do not demand PRISMA — narrative ≠ systematic).
+**RV1 — Novelty & value-add** *(editorial value-add axis)*: Against ≥2–3 recent reviews/primers on the same topic, does the manuscript state explicitly what it adds? For saturated topics, if the authors do not position their contribution against the current review literature, the incremental value is hard to judge — MAJOR candidate. Judge contribution magnitude only; scope-fit is the editor's call.
+**RV2 — Scope & aims clarity** (SANRA items 1–2): Is the topic's importance established, and are the review's aims and scope boundaries (what is included/excluded) explicit?
+**RV3 — Evidence-gathering transparency** *(SANRA item 3, suggestion-level)*: Even a narrative review benefits from one paragraph on how the literature was identified (databases, time window, selection logic). This is **not a reject criterion** — phrase it as a SANRA-aligned transparency suggestion. Do not require PRISMA.
+**RV4 — Technical & medical accuracy** *(reviewer niche strength)*: Engineering correctness (autoregressive decoding, RAG, RLHF, instruction tuning, hallucination mechanisms, evaluation/mitigation methods) and medical correctness (radiology claims, clinical examples, anatomy/imaging detail). Itemize errors with location. This axis is where a domain-literate reviewer adds unique value.
+> **Verify-your-own-criticism gate**: before raising a technical inaccuracy or a citation–claim mismatch as a major finding, cross-check the assertion against a current authoritative source (the full cited paper, CrossRef, arXiv). Fast-moving fields make critiques go stale: a method dismissed as "not applicable" may have been adapted, and a "preprint" may since have been peer-reviewed. If unverified, downgrade to a hedged "Please verify…"; if confirmed, state it firmly. This applies with extra force to claims about what a cited reference *argues* (a review about hallucination must not itself mis-attribute a source).
+**RV5 — Taxonomy / synthesis coherence**: Is the manuscript's classification mutually exclusive and collectively exhaustive, and does it map to established taxonomies (intrinsic vs extrinsic; faithfulness vs factuality; published hallucination surveys)? Ad-hoc categories should be reconciled with an established taxonomy. Is the synthesis integrative rather than a list?
+**RV6 — Balance, currency, citation accuracy** (SANRA items 4–5): Is conflicting evidence handled fairly (no cherry-picking)? Are citations current and primary-source-weighted? Spot-check citation accuracy (author/year/claim match) — for a review *about* hallucination, citation errors are thematically critical.
+**RV7 — Load-bearing figures/tables** *(editorial value-add axis; SANRA item 6 secondary)*: Are there standardized comparison tables, a landscape figure, or a concrete clinical worked example? Assess whether figures/tables carry synthesis weight or are decorative — strong radiology-AI reviews tend to use standardized comparison matrices and a worked example.
+**RV8 — Constructive gap-filling & additions** *(the expected-role probe)*: Identify missing topics/frameworks/key references and propose them as **"consider adding X because it directly supports Y"** — never "must cite." Tier candidates by publication status:
+- *Peer-reviewed guidelines*: TRIPOD-LLM, MI-CLAIM-GEN, and STARD-AI (all Nature Medicine), and the CLAIM 2024 update (Radiology: AI)
+- *Preprint (label as such)*: any not-yet-peer-reviewed arXiv/medRxiv item — name it as a preprint and do not place it at the same level as peer-reviewed guidelines. Verify status before citing, since preprints are frequently published later (a checklist first posted to arXiv may since have appeared in a journal)
+- *Concepts/tools*: RAG specifics (retrieval failure vs fabrication), uncertainty/confidence calibration, radiology-specific evaluation (RadGraph, CheXbert/CheXpert-F1, ReXTrust), regulatory context (FDA 510(k)/CE, RADAR)
+Keep additions **proportionate** (≈ ≤1 new reference per page, each motivated; no wholesale rewrite). Suggesting missing *literature/topics* is expected; demanding new *studies* is not.
+**Output template (RV1 example)**:
+> "The topic of LLM hallucinations is now addressed by several recent reviews, so it would strengthen the manuscript to state explicitly what this primer adds beyond them — for example, a radiology-specific failure taxonomy, a worked clinical example, or an actionable verification workflow that existing general-purpose reviews do not provide. As written, the Introduction does not position the contribution against the current review literature, which makes the incremental value difficult to judge."
+**Output template (RV8 example)**:
+> "The mitigation section would benefit from engaging with emerging reporting standards for generative models, as these directly support the manuscript's call for controlled deployment. Consider adding a brief discussion of TRIPOD-LLM and MI-CLAIM-GEN (both peer-reviewed reporting guidelines for LLM/generative studies), and clarifying how retrieval-augmented generation shifts the dominant failure mode from fabrication toward retrieval error rather than eliminating hallucination, a distinction the current text conflates."
+This module gives review/narrative manuscripts a dedicated audit gate, on the principle that constructive gap-filling is an expected part of appraising a review article.

package/skills/self-review/references/domain-probes/observational_confounding.md ADDED Viewed

@@ -0,0 +1,48 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a confounding /
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Observational / Confounding probes (O1–O6)
+A 6-probe checklist for observational studies (cohort, case-control, cross-sectional, health-screening registry) where the central claim is an exposure–outcome association estimated by adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between what a manuscript *says* it adjusted for and what the exposure-stratified data show. O1 is data-checkable and the highest-yield probe — the self-review skill automates it as a deterministic gate (Phase 2.5e, `scripts/check_confounding_completeness.py`) that reads the exposure-stratified Table 1 and the Methods adjustment set.
+**O1 — Confounding completeness (measured-but-unadjusted)**:
+- Does the exposure-stratified baseline table (Table 1 by exposure) show covariates that are **significantly imbalanced** across exposure groups (p < 0.05, or a standardized mean difference > 0.1) yet are **absent from the adjustment set**?
+- A covariate that was measured, is imbalanced by exposure, and is a plausible cause of the outcome is residual confounding by a *measured* variable — the most preventable kind. Common offenders in metabolic / screening cohorts: smoking pack-years, uric acid, HDL, total cholesterol, HbA1c, eGFR.
+- Is the adjustment set justified (DAG, prior literature, or a pre-specified plan), or is it a short default (age/sex/BMI + a few comorbidities) that silently omits imbalanced labs?
+- Measured-but-unadjusted imbalanced covariate(s) → MAJOR. Recommend an extended-adjustment sensitivity model that adds the omitted covariates and reports whether the primary estimate is robust; the original model stays primary only if the extended model agrees.
+**O2 — Adjustment-set provenance (DAG vs Table-1-stepwise)**:
+- Was the adjustment set chosen by a causal structure (DAG / explicit confounder reasoning) or by a data-driven "include if Table 1 p < 0.05" / stepwise rule?
+- Data-driven selection risks both directions: **over-adjustment** for mediators or colliders (a variable on the causal path, or a common effect of exposure and outcome, biases the estimate) and **under-adjustment** for a confounder that happens to be balanced in this sample.
+- No stated rationale for inclusion/exclusion of each adjustment variable → MAJOR (the same model can be confounded and over-adjusted at once).
+**O3 — Selection / collider bias at enrollment**:
+- Is the cohort a self-selected or conditioned sample (health-screening attendees, survivors, a registry conditioned on having had the index test) such that enrollment is a collider opening a backdoor path?
+- Index-event bias (conditioning on a first event), immortal-time bias (exposure defined over a window during which subjects must survive), and prevalent-user bias addressed?
+- Unaddressed selection/collider structure that could generate the reported association → MAJOR; at minimum require an explicit selection-bias paragraph and, where possible, a sensitivity analysis.
+**O4 — Exposure measurement validity**:
+- Is the exposure a validated/quantitative measure or an unvalidated binary flag (e.g., a single reader's visual call, an ICD code, a self-report) with no in-cohort reliability (κ / ICC) and no severity gradient?
+- Structural-zero dose covariates: a dose/duration variable anchored to a categorical exposure (never-smoker → pack-years = 0, never-drinker → grams = 0) must be treated as a structural zero, not missing — misclassification here both mismeasures the exposure and (O5) collapses the analytic sample.
+- Non-differential misclassification biases toward the null (an underpowered null is not reassurance); differential misclassification can bias either way. Binary/unvalidated exposure with no reliability estimate → MAJOR (or a prominent limitation with a quantitative bias argument).
+**O5 — Missing-data mechanism & complete-case collapse**:
+- Is the missing-data mechanism (MCAR / MAR / MNAR) stated and justified, with the missingness fraction per key variable reported (ideally by exposure stratum)?
+- Does a dose/duration covariate (pack-years, cessation duration, alcohol grams) entering a complete-case multivariable model collapse n in the unexposed stratum (structural zeros dropped as missing), distorting subgroup estimates? Report n before and after model fitting.
+- If multiple imputation is used, are the mechanism assumption, the number of imputations, the imputation model, and a seed reported, and are structural zeros kept out of the imputation? Unjustified MAR for a large missing fraction, or an undisclosed complete-case collapse → MAJOR.
+**O6 — Residual confounding quantification (E-value)**:
+- Is an E-value (or a comparable quantitative-bias / negative-control analysis) reported for the **primary** estimate and its confidence limit, so a reader can judge how strong an unmeasured confounder would need to be to explain the association?
+- An E-value computed for a non-primary, supporting estimate but quoted as if it bounds the primary claim is a provenance error (the E-value must trace to the declared primary contrast).
+- For a non-null primary association presented as actionable with no residual-confounding quantification → MAJOR (request an E-value at the point estimate and the bound nearest the null); for a null primary, residual confounding is less load-bearing but power (see the power-aware null check) should be addressed instead.
+**Output template (O1 example)**:
+> "Table 1 shows that uric acid (p < 0.001), smoking pack-years (p = 0.001), HDL (p < 0.001), total cholesterol (p = 0.010), and HbA1c (p < 0.001) differ significantly across exposure groups, but the multivariable model adjusts only for age, sex, BMI, hypertension, and diabetes. Because these imbalanced laboratory covariates are plausible causes of the outcome, the reported association may carry residual confounding by measured variables. I'd suggest reporting an extended-adjustment sensitivity model that adds the imbalanced covariates and stating whether the primary estimate is materially unchanged; if the extended model attenuates the association, that should be reflected in the Abstract and Conclusions."
+**Output template (O5 example)**:
+> "The multivariable model appears to be complete-case, and pack-years is included as a continuous covariate. Because never-smokers carry a structural zero rather than a measured value, complete-case deletion can drop a large share of the unexposed stratum (here the analytic n falls from 5,203 to 1,993, with the female subgroup reduced to n ≈ 58), which distorts the subgroup estimates. I'd suggest adjusting for smoking status (never/former/current) rather than pack-years, reserving pack-years for an ever-smoker-restricted secondary analysis, and reporting the missingness fraction by exposure stratum with the MCAR/MAR/MNAR rationale."

package/skills/self-review/references/domain-probes/radiomics.md ADDED Viewed

@@ -0,0 +1,38 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Radiomics / Feature-Reproducibility probes (R1–R4)
+A 4-probe checklist for radiomic feature reliability/reproducibility, acquisition–reconstruction parameter sweeps, and reliability/harmonization-based feature filtering claims. These probes complement (do not replace) the generic Phase 2 issue checklist. Their purpose is to keep design-level structural validity from being under-weighted: a review can correctly flag the reporting-layer issues (an over-claiming Abstract, a small external cohort) yet still miss whether the central contribution holds, which softens the assessment by one notch.
+**R1 — Design-grid circularity (in-domain "prediction" tautology)**:
+- Is an outcome (e.g., feature reliability) predicted from the very grid parameters that were systematically/exhaustively varied to construct the dataset?
+- If so, a high in-domain R² / accuracy is structurally guaranteed by the design ("predicting the construction recipe"), not a discovered relationship — do the predictors simply index the axes of the design grid?
+- Does the manuscript frame in-domain performance as a finding/success and lead the Abstract/Key Points with it?
+- If yes → do **not** endorse the in-domain success. Recommend reframing so the substantive finding is the cross-domain transportability (and, where present, its failure). MAJOR candidate.
+**R2 — Construct validity / proxy-target gap**:
+- The clinical rationale typically assumes that features which are reliable/stable/robust in the phantom are also better predictors of a biological/clinical target. A feature can be perfectly stable and biologically uninformative — this link is not logically guaranteed.
+- Is any post-filter performance gain shown to be signal recovery, rather than a by-product of removing a degraded/misaligned baseline feature space?
+- Does the manuscript acknowledge and test the orthogonality of the proxy (reliability) and the target (outcome)? Absent → MAJOR candidate.
+**R3 — Transportability framing vs reporting issue**:
+- When cross-phantom / cross-scanner / cross-center failure (negative R² on the target domain, low Jaccard overlap of selected features, calibration slope < 1) is the substantive result, is it nonetheless framed as a generalization success in the Abstract/Key Points/Conclusion?
+- Does the Results text state explicitly that a negative R² on the target domain means the model performs worse than predicting the mean (i.e., the mapping does not transport), rather than reading it as a weak continuous performance metric?
+- **Calibration link**: if in-domain "success" is partly a design artifact and the cross-domain result is a failure, reframing the Abstract will not rescue the central contribution. This is a design-level finding, not a reporting fix — keep its severity at the design level and do not soften it to a reporting issue.
+**R4 — Multiplicity (model × threshold / model × cohort grid)**:
+- Are multiple classifiers × multiple reliability thresholds (or cohorts) compared with one-sided tests, with a few reaching p < 0.05?
+- Is multiple-testing correction applied, and is the expected number of false positives by chance named explicitly (e.g., "5 models × 3 thresholds = 15 tests, ≈1 expected false positive")? Do not defer this to a generic "statistical review needed."
+- For a small external cohort (n ≤ ~30), do bootstrap ΔAUC intervals cross zero? If so, restrict any headline-gain claim accordingly (e.g., to a single classifier family in a small cohort).
+**Output template (R1 example)**:
+> "Because the acquisition parameters were varied as a systematic factorial grid, a model that predicts feature reliability from those same parameters is largely recovering the grid by construction; the in-domain R² ≈ 1.0 therefore reflects design structure rather than a discovered relationship. I'd suggest reframing the Abstract and Key Points so the substantive finding is the cross-phantom/cross-scanner transportability (and its failure), and stating explicitly in the Results that a negative R² on the target domain means the model performs worse than predicting the mean — i.e., the reliability mapping does not transport."
+**Output template (R4 example)**:
+> "The reported gains come from a grid of [N models] × [M thresholds] one-sided comparisons; with [N×M] tests, roughly one positive is expected by chance alone, and the external cohort (n = [k]) yields bootstrap ΔAUC intervals that cross zero for several thresholds. I'd suggest reporting a multiplicity-adjusted analysis (or stating the expected false-positive count), restricting the headline claim to the classifier family that survives, and marking the ΔAUC intervals that cross zero in the figure."

package/skills/self-review/references/domain-probes/sr_ma.md ADDED Viewed

@@ -0,0 +1,87 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Systematic Review / Meta-Analysis probes (P0–P10)
+Internal-consistency-first gate (P0) plus a 10-probe checklist (P1–P10). These probes complement (do not replace) the generic Phase 2 issue checklist.
+**P0 — Internal-consistency-first gate (run before P1; gates any fabrication claim)**:
+- Before alleging fabrication on a manuscript that "feels AI-generated", reproduce the headline pooled statistics, paired study counts (k), and subgroup counts directly from the extracted data table (or supplement included-studies table).
+- If paired k, pooled medians, and subgroup counts reproduce, fabrication is unlikely — **pivot the review to table-vs-source fidelity (P1), comparator definition (P1), and eligibility**, not to a fabrication framing.
+- Only if the table cannot be reproduced, or is internally inconsistent, escalate to a transparency/integrity MAJOR.
+- Rationale: an "AI-smelling" surface is not evidence of fabrication. Real references can be present and the arithmetic coherent while the substantive flaws are extraction, comparator, eligibility, and overclaiming.
+**P1 — Performance-MA value + comparator-existence probe**:
+- For method-comparison MAs reporting accuracy / DSC / AUC / F1 (model-vs-model, AI-vs-reader, two training paradigms) and for DTA MAs reporting sensitivity / specificity, select ≥2 outlier or headline-driving studies.
+- (a) Verify each sampled arm value against the source paper (PubMed abstract or full text). For DTA cells, check for **sens/spec swap** (source sens=A% / spec=B% appearing in the forest as sens=B% / spec=A%).
+- (b) **Comparator-existence check**: verify the comparator arm is consistently defined and actually exists in each source. A baseline mislabeled as the comparator inflates the headline (e.g., a limited single-source baseline reported as a "centralised" comparator when the source paper has no centralised arm).
+- (c) Per-study schema: `Exists | Correct citation | Eligible (domain-specific) | Same comparator (same task/dataset) | Value matches source | Author-derived/averaged | Verdict`.
+- (d) Severity ladder: `<1pp rounding or author-derived average = minor`; `wrong dataset/task/comparator or not domain-specific = major`; `unfindable or wrong-citation = integrity concern (verify against source); potentially major`.
+- If a confirmed error drives a reported subgroup p-value or a headline claim, register as a primary major finding.
+**P2 — Cohort / benchmark non-independence probe**:
+- Identify clusters in included studies sharing: (a) institution name, (b) author surname + year proximity, (c) public ICU/EHR database (MIMIC-IV, eICU, MIMIC-III, KNHIS, UK Biobank, Optum, MarketScan, IBM), (d) **public imaging-challenge benchmark** (BraTS, FeTS, TCIA, Kaggle) reused across multiple included studies.
+- For each cluster, fetch PubMed efetch affiliation + abstract Methods database/benchmark source.
+- Flag pairs sharing the same data source + overlapping enrollment period (or the same public benchmark) as "high-confidence non-independence".
+- Manuscript should acknowledge in Limitations + perform a leave-one-dataset-out sensitivity analysis and add a data-provenance column to Table 1. If absent → MAJOR.
+- **Nuance**: map provenance and *request* the provenance column + sensitivity analysis; do NOT assert that a specific study used a given benchmark from coarse supplement labels alone (e.g., a supplement labeling a study only as "Hospital" or "Public" does not confirm BraTS/FeTS use). Confirm against the source before stating it.
+**P3 — Diagnostic subset N transparency (mixed DTA + prognostic MA)**:
+- Compute bivariate pool denominator (TP+FP+TN+FN) from Table 2 or forest plot.
+- Compare to total N reported in Abstract.
+- If diagnostic subset is <50% of total without explicit "diagnostic subset N = X / Y" in Results → MAJOR transparency gap.
+**P4 — k=1 subgroup flag**:
+- Inspect subgroup analyses for strata with k=1 (single included study).
+- If a reported subgroup p-value is driven by k=1 stratum → flag MAJOR.
+- Recommend reframing as exploratory or removing from formal subgroup test.
+**P5 — Supplementary completeness check**:
+- SR-MA supplementary must contain at minimum:
+  - PRISMA / PRISMA-DTA checklist with page refs
+  - Full-text exclusion list with reasons (per PRISMA 2020 item 16b)
+  - Per-study data extraction table
+  - Per-study × per-domain risk-of-bias table (QUADAS-2 / QUADAS-AI / PROBAST / PROBAST-AI)
+  - Full search strategy verbatim per database
+- If supplementary contains only figure captions or is missing 3+ of these → MAJOR.
+**P6 — PROSPERO ID format + live URL request**:
+- Standard PROSPERO format: `CRD42` + 4-digit YYYY + 6-digit sequential = 13 chars total. Some pre-2020 IDs are 12 chars (5-digit sequential).
+- IDs with >13 chars or non-numeric tail → FORMAT_ANOMALY (MAJOR).
+- Always request authors provide live registration URL in cover letter for protocol cross-check.
+**P7 — Reference duplicate detection** (extends `/verify-refs`):
+- Run `/verify-refs` (PubMed + CrossRef). In addition to standard checks, detect duplicate PMID or DOI within reference list.
+- Verbatim duplicates indicate LLM-assisted reference compilation error → MAJOR (cite renumbering required).
+**P8 — AI Disclosure presence**:
+- `grep -iE "chatgpt|gpt-|llm|generative ai|ai was used|ai-assisted|copilot|claude|gemini|chatbot|large language model"` on manuscript body.
+- If 0 matches AND journal requires AI Disclosure (RYAI / Radiology / RSNA family / Lancet family / JAMA family / most BMJ family / Nature family) → flag MINOR-to-MAJOR.
+**P9 — Non-significant finding promoted to Abstract (overclaim probe)**:
+- Flag any exploratory or non-significant result (a crossover, a trend, a post-hoc subgroup) that appears in the Abstract or Key Points framed as a finding.
+- Sub-check: does the promoted finding depend on a study flagged or mis-extracted under P1? (A headline crossover can collapse once a mis-extracted comparator is corrected.)
+- Flag "non-inferiority" / "equivalence" asserted without a pre-specified margin. A margin cannot be pre-specified retrospectively — ask the authors to document any pre-existing protocol margin, otherwise drop the non-inferiority language or present it explicitly as a post hoc equivalence / sensitivity analysis.
+**P10 — Citation-metadata confusion class (over-escalation guard)**:
+- DOI-suffix digits that surface as an apparent article number (e.g., a DOI tail "77196" against article number 26068, or "60466-1" against 6274) are cosmetic metadata confusion, **not** fabrication — do not escalate them as fabricated references.
+- Reference-list duplicates are handled by `/verify-refs` (`duplicate_findings[]`); AI-disclosure presence is the cross-cutting P8 check. Neither is unique to SR/MA.
+**Output template (P1 cell-swap example)**:
+> "I spot-checked [Author Year] (PMID [...]) against the source paper and found that the values in Figure X are swapped. The source paper reports external-test sensitivity A% / specificity B% (n=N); the manuscript forest entries place [num1/denom1] in the sensitivity slot (which is the source's specificity numerator/denominator) and [num2/denom2] in the specificity slot (which is the source's sensitivity)."
+**Output template (P1 comparator-existence example)**:
+> "I spot-checked [Author Year] (PMID [...]) against the source. The manuscript lists this study's comparator ('[label]', [value]) in [comparison], but the source paper does not report that arm; the [value] appears to be the study's [limited single-source baseline]. Because this entry contributes to [the pooled comparison / a headline claim], I'd suggest re-extracting the comparator definition per study and adding a comparator-definition column to Table 1 so readers can confirm each arm is the same task on the same data."
+**Output template (P2 example)**:
+> "[Author1 Year1] uses [Database] (N=...). [Author2 Year2] uses [Database] (N=...). These are nearly certainly overlapping patient pools, and the statistical independence assumption for MA pooling is violated. I'd suggest a sensitivity analysis excluding one of the two studies, plus an explicit cohort-source column in Table 1."
+**Discipline — leads vs findings (applies to every P0–P10 probe)**:
+- Output from a forensic sub-agent or automated scan is a **lead, never a finding, until confirmed against the source.** Concrete failure modes to discard on inspection: treating recent (in-press / current-year) publication dates as "impossible", inventing journal article-number rules, and inflated all-or-nothing fabrication-risk scores.
+- Before finalizing, run an **overclaim sweep of your own draft** (mandatory external-QC pass — independent model or colleague). Two worked examples: a strong claim that "the references are real, not fabricated" should be narrowed to "the sampled references / DOIs resolved"; a benchmark example list should be trimmed to studies whose benchmark use was source-confirmed.
+- **Do not compute chance-probabilities** for suspicious or identical values. Record the observation neutrally: "exact match to ≥2 decimals; source verification pending."

package/skills/self-review/references/domain-probes/survival_prognostic.md ADDED Viewed

@@ -0,0 +1,68 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Survival / Prognostic Model probes (S1–S8)
+An 8-probe checklist for time-to-event outcomes and prognostic model development. These probes complement (do not replace) the generic Phase 2 issue checklist and may be co-applied with the SR-MA probes for a meta-analysis of prognostic models.
+**S1 — Conditioning / causal framing**:
+- Does the manuscript claim a "preoperative" / "screening" / "triage" / "X replaces Y" use case while outcomes are conditioned on the downstream treatment whose value the model is supposed to inform?
+- Inputs include post-decision variables (resection margin status, adjuvant chemo/radiotherapy, transplant status) that are unknown at the claimed decision point?
+- Non-treatment comparator or causal framework present?
+- Conditioning gap → MAJOR candidate. Recommend retrain without leaky variables / add non-treatment arm / reframe intended use.
+- **Time origin & survivorship** (incident / transition models): is the at-risk clock started at the correct origin for each incident model, with immortal time (a span in which the event cannot occur, misattributed to one group) and left-truncation / delayed entry handled? Is a "progressor" / transition label conditioned on *surviving to* a later ascertainment (a second scan, a follow-up visit) — a survivorship that needs a landmark time or an explicit intermediate-state model? If the primary analysis is **not** the full cohort (e.g., complete-case while a large fraction is missing) and the complete-case model is the significant one, that selection needs a stated justification and a MAR rationale — an outcome-dependent choice of the analysis set is the S8 concern. Any of these unhandled → MAJOR.
+- **Self-confession escalation**: a Methods or Limitations admission that a time-origin, immortal-time, return-conditioning, or selection issue was *"not formally assessed"* (or equivalent) is itself a MAJOR — it names a known bias that was left unaddressed, not a mitigated limitation.
+**S2 — Censoring handling in training loss**:
+- Cox partial-likelihood loss or DeepSurv-style loss specified? How is censoring handled (right-censoring, interval-censoring, informative censoring by death)?
+- If Methods describe a Cox or partial-likelihood loss but do not specify censoring treatment, register as MAJOR (reproducibility).
+- Covariate incompleteness in the model fit is part of the same disclosure: is a structural-zero covariate (a never-smoker's pack-years = 0 by definition, not missing) handled as a zero, or dropped under complete-case so the unexposed stratum and the events-per-variable silently collapse? An undisclosed complete-case collapse from a dose/duration covariate is a reproducibility + power issue — adjust on the categorical status and reserve the continuous dose for an exposed-only secondary analysis.
+**S3 — Competing risks**:
+- 2+ event types (local recurrence + distant metastasis + death, or cause-specific mortality) modeled?
+- Cause-specific hazards or Fine-Gray subdistribution hazards used?
+- Patient developing one event still at risk for the other (informative censoring by death)?
+- If competing-risks structure is ignored and outcomes are treated as independent right-censored events → MAJOR.
+**S4 — Cutoff derivation optimism**:
+- Cutoffs derived via maximally selected log-rank statistics, AUC-based Youden's J, or similar data-driven methods?
+- Hothorn-Lausen correction or equivalent optimism correction applied?
+- Was the same cohort used for both model selection (hyperparameter tuning) AND cutoff selection? (Optimism bias)
+- Bootstrap optimism estimate or sensitivity analysis on cutoff choice (e.g., ±0.5 SD perturbation)?
+- Same-cohort dual use without correction → MAJOR.
+**S5 — Comparator horizon alignment**:
+- External baseline prognostic nomogram (commonly designed for 5- or 10-year endpoints) applied as the comparator?
+- Manuscript's available follow-up duration aligned with that horizon?
+- Mismatch → baseline C-index degradation may reflect design-horizon mismatch ≠ intrinsic inferiority. Recommend time-dependent C-index or time-stratified analyses.
+- Baseline implementation specified: applied as published, locally recalibrated, or refit as a new Cox model with similar variables?
+- Unclear implementation → MAJOR (a refit local model should be described as a clinicopathologic comparator, not a "guideline model").
+**S6 — C-index variant + reverse Kaplan-Meier follow-up**:
+- Which C-index variant: Harrell's C, Uno's C, time-dependent AUC, IPCW-C?
+- Variant appropriate for the censoring distribution and sample size?
+- Time-dependent AUC at a clinically anchored horizon (e.g., 2-year, 3-year) reported alongside Harrell's C?
+- Reverse Kaplan-Meier median follow-up reported per cohort and per outcome (LR vs DM separately) with censoring date?
+**S7 — Calibration beyond discrimination**:
+- Calibration plot (intercept / slope) across all cohorts?
+- Brier score / Integrated Brier Score (IBS)?
+- Decision-curve analysis at clinically relevant probability thresholds?
+- For a prognostic model intended to guide surveillance intensity, treatment intensification, or eligibility for adjuvant therapy, discrimination alone is insufficient. If Methods mention calibration but Results/supplement contain no calibration plot or numeric metrics → MAJOR.
+**S8 — Estimand provenance**:
+- Is the survival estimand stated explicitly and held consistent across Abstract / Methods / Results — event-free survival, cause-specific cumulative incidence, all-cause mortality — and at the subject vs population level? A subdistribution hazard (Fine-Gray) answers a different question than a cause-specific hazard; quoting an sHR for an etiologic claim, or a cause-specific HR for an absolute-risk claim, is an estimand mismatch.
+- Is the evaluation horizon (2-/3-/5-year) and the primary model fixed in advance and consistent with the registered/pre-specified primary endpoint, or was the primary endpoint, model, or horizon re-designated after the results were known (outcome-dependent primary selection)?
+- Does every derived statistic (E-value, an sHR-vs-cause-specific-HR contrast) trace to the *declared primary* estimand, or is a supporting/non-primary estimate quoted as if it bounded the headline claim?
+- Estimand drift — a primary re-designated post-hoc, or a derived statistic computed on a non-primary estimate but presented as primary → MAJOR. Recommend reporting the pre-specified and revised models coequally, disclosing the change, and recomputing any E-value for the primary estimate. (The self-review skill automates the registration ↔ manuscript and E-value arithmetic checks as Phase 2.5f, `scripts/check_claim_artifact.py`.)
+**Output template (S4 example)**:
+> "The Methods (p. X) state that optimal cutoffs for [outcome] were determined via maximally selected log-rank statistics on the internal validation cohort. Two concerns: (a) Hothorn-Lausen correction is cited but it is unclear whether the corrected p-value was used in the cutoff selection; (b) the internal validation cohort appears to have been used for both model selection and cutoff selection, which is a known source of optimism. I'd suggest reporting bootstrap-based optimism estimates or a sensitivity analysis showing how external performance shifts under ±0.5-SD perturbation of the chosen cutoff."
+**Output template (S5 example)**:
+> "The chosen baseline nomogram was originally designed and validated for prediction of long-horizon endpoints (5- and 10-year). In this study, median follow-up in [external cohort] is substantially shorter than that horizon, so the comparator's apparent underperformance may partly reflect a horizon mismatch rather than intrinsic inferiority. I'd suggest (a) stating explicitly the time horizon at which both models were evaluated, (b) reporting time-dependent C-indices at a clinically anchored horizon, and (c) clarifying whether the comparator was applied as published, recalibrated locally, or refit as a new Cox model with similar variables."

package/skills/self-review/references/exemplar_findings/README.md ADDED Viewed

@@ -0,0 +1,43 @@
+# Exemplar Anticipated Comments — gate result → self-review finding
+`/self-review` runs deterministic Phase 2.5 gates (cohort arithmetic, confounding
+completeness, scope coherence, claim-vs-artifact) and a systematic A–K check, then writes
+**Anticipated Major / Minor Comments** the author can fix before a reviewer sees them. This
+directory models how a *gate hit* becomes a well-formed Anticipated Comment — the missing
+worked-example layer between "the gate fired" and "here is the comment + fix."
+Each file shows the same shape `/self-review` Phase 3 produces:
+1. **What fired** — the deterministic gate (or category) and the specific signal.
+2. **Anticipated Major/Minor Comment** — phrased the way the *reviewer* will phrase it, so
+   the author reads it as the warning it is.
+3. **Severity** — Fatal (conclusion-threatening / design-level → Anticipated **Major**) vs
+   Fixable (reporting-level → Anticipated **Minor**).
+4. **Category** — the closest letter (A–K).
+5. **Fix** — the concrete change to make now, and whether it is `fixable_by_ai`.
+6. **R0-ready** — a one-line form suitable for Phase 3b numbering into the `/revise`
+   pipeline.
+These differ from `/peer-review`'s `exemplar_reviews/`: those model a *reviewer's*
+partner-voice comment to authors; these model the *author's* anticipation-and-fix entry.
+## Contents
+- `cohort_arithmetic_mismatch.md` — STROBE cascade / rate back-calc fails (gate:
+  `check_cohort_arithmetic.py`) → category A.
+- `unadjusted_confounder.md` — an imbalanced measured covariate left out of the model
+  (gate: `check_confounding_completeness.py`) → category C/E.
+- `scope_overreach_cross_sectional.md` — a prognostic/surveillance claim from a
+  cross-sectional design (gate: `check_scope_coherence.py`) → category D.
+- `estimand_drift_posthoc_primary.md` — the reported "primary" differs from the registered
+  one (gate: `check_claim_artifact.py`) → category C.
+## Curator guidelines
+- **Synthetic only.** Author the example with placeholder numbers; never paste a real
+  manuscript's text or data. No PII, no real citations, English only.
+- **One gate/finding per file**, end to end (fired → comment → severity → category → fix →
+  R0 line).
+- **Tie severity to the same Fatal/Fixable rule** the skill uses, and name the gate/script
+  that surfaces it (do not re-document the gate — link by name).
+- Keep each file ~40–70 lines.

package/skills/self-review/references/exemplar_findings/cohort_arithmetic_mismatch.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Exemplar — cohort arithmetic does not reconcile
+**Fired:** `check_cohort_arithmetic.py` — `CASCADE_SUM` (the STROBE flow does not balance:
+start N − Σ(exclusions) ≠ final analytic N) and/or `RATE_BACKCALC` (a reported incidence
+rate does not reproduce from numerator/person-time).
+**Severity:** Fatal → Anticipated **Major**. **Category:** A. Study Design & Data Integrity.
+## Anticipated Major Comment (how a reviewer will put it)
+> The participant flow does not add up. The Methods report 1,000 screened and exclusions of
+> 120 + 60 + 40, which leaves 780, but the analytic cohort is given as 800 (Figure 1 vs
+> Results, first paragraph). Please reconcile the flow diagram, the text, and Table 1 so the
+> numbers are internally consistent, and state which figure is correct.
+>
+> Relatedly, the incidence rate of 12.0 per 1,000 person-years with 9,800 person-years
+> implies ~118 events, but 96 events are reported (Table 2). Please confirm the numerator,
+> denominator, and rate.
+## Severity / category rationale
+This is **Fatal** because the cohort size and the event/person-time are the denominators of
+every downstream estimate — if they are inconsistent, the reader cannot trust any rate or
+effect size. It is **category A** (data integrity), not a wording issue.
+## Fix
+Recompute the cascade and the rate from the source data (never hand-retype), correct the
+diagram/text/table to a single set of numbers, and add a one-line reconciliation note if a
+late exclusion was applied. `fixable_by_ai: false` — the numbers must come from the data,
+not be guessed; the author re-derives from the CSV.
+## R0-ready line
+> R0-A1 (Major, Fatal): STROBE flow and the incidence rate do not reconcile (Figure 1 /
+> Results / Table 2); re-derive from source and unify. [gate: check_cohort_arithmetic]

package/skills/self-review/references/exemplar_findings/estimand_drift_posthoc_primary.md ADDED Viewed

@@ -0,0 +1,39 @@
+# Exemplar — the reported "primary" differs from the registered one
+**Fired:** `check_claim_artifact.py` — `PRIMARY_REASSIGNED` / `ESTIMAND_DRIFT` (the
+manuscript's stated primary analysis or estimand does not match the pre-registration /
+protocol), or an E-value attached to a non-primary or non-reproducing estimate.
+**Severity:** Fatal → Anticipated **Major**. **Category:** C. Validation & Statistical
+Reporting.
+## Anticipated Major Comment (how a reviewer will put it)
+> The registered protocol names the complete-case model as the primary analysis, but the
+> manuscript presents the multiple-imputation model as primary (Methods, *Primary analysis*
+> vs the registration). Selecting the primary after seeing results is outcome-dependent and
+> can bias inference. Please either (a) restore the pre-registered primary and present the
+> other model as a pre-specified sensitivity analysis, or (b) if the change was unavoidable,
+> report both models coequally, disclose the change in the Abstract and a Limitations
+> paragraph, and lodge the corresponding registration amendment.
+>
+> The reported E-value of 2.79 should also be recomputed from, and attached to, the primary
+> estimate (it currently appears to derive from a different model).
+## Severity / category rationale
+**Fatal** because *which* result is primary determines the paper's headline; choosing it
+post hoc is a credibility issue a methods reviewer catches deterministically. **Category C**
+(estimand provenance). This is the estimand-provenance-lock principle: primary contrast and
+derived statistics trace to the pre-registration, not to the data.
+## Fix
+Restore the registered primary (or present both coequally with disclosure + amendment),
+recompute the E-value from the declared primary estimate, and propagate the framing to every
+claim site (Abstract, Highlights, any plain-language summary). `fixable_by_ai: false` —
+requires the registered protocol as the source of truth and a re-run.
+## R0-ready line
+> R0-C1 (Major, Fatal): reported primary ≠ registered primary; restore or report coequally +
+> disclose + amend; recompute E-value from the primary. [gate: check_claim_artifact]

package/skills/self-review/references/exemplar_findings/scope_overreach_cross_sectional.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Exemplar — a prognostic/surveillance claim from a cross-sectional design
+**Fired:** `check_scope_coherence.py` — `CROSS_SECTIONAL_PROGNOSTIC` (a single-timepoint /
+cross-sectional design signal co-occurs with a conclusion-region action verb such as
+"predict", "progression", "rescreen", "surveillance").
+**Severity:** Fatal → Anticipated **Major**. **Category:** D. Clinical Framing & Importance.
+## Anticipated Major Comment (how a reviewer will put it)
+> The study is cross-sectional — exposure and outcome are measured at a single visit
+> (Methods, *Study design*) — but the Conclusion recommends a "surveillance interval" and
+> describes "disease progression." A cross-sectional association cannot establish temporal
+> order or progression, so these claims outrun the design. Please reframe the conclusion to
+> the association actually estimated (e.g., "X was associated with Y at the index visit") and
+> move any prognostic or surveillance language to a clearly-labeled hypothesis for future
+> longitudinal work.
+## Severity / category rationale
+This is **Fatal** because the gap is between the *design* and the *clinical action the paper
+recommends* — a reader could adopt a surveillance schedule that the data do not support. The
+fix is reframing, not new analysis, but the claim cannot stand as written. **Category D**
+(framing / endpoint↔conclusion scope).
+## Fix
+Rewrite the Abstract Conclusion and the Discussion's closing claim to the cross-sectional
+estimand; demote prognostic/surveillance statements to "future directions." Check that the
+Title does not imply prediction. `fixable_by_ai: true` (wording), but verify the author
+agrees the longitudinal claim is genuinely unsupported before softening.
+## R0-ready line
+> R0-D1 (Major, Fatal): cross-sectional design but the Conclusion claims surveillance/
+> progression; reframe to the index-visit association. [gate: check_scope_coherence]

package/skills/self-review/references/exemplar_findings/unadjusted_confounder.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Exemplar — an imbalanced measured covariate is left out of the model
+**Fired:** `check_confounding_completeness.py` — `UNADJUSTED_IMBALANCED` (a covariate that
+is measured and imbalanced across exposure groups in Table 1 is not in the Methods
+adjustment set).
+**Severity:** usually Fixable → Anticipated **Minor**; Fatal → Anticipated **Major** when
+the covariate plausibly explains the primary association. **Category:** C. Validation &
+Statistical Reporting (with E. Reproducibility for the model spec).
+## Anticipated Major/Minor Comment (how a reviewer will put it)
+> Baseline smoking differs between the exposed and unexposed groups (Table 1: 41% vs 23%,
+> standardized mean difference 0.39), and smoking is a plausible confounder of the
+> exposure–outcome relationship, but it does not appear in the adjustment set (Methods,
+> *Statistical analysis*). Please either add it to the multivariable model or explain why it
+> was excluded, and report whether the primary estimate changes when it is included.
+## Severity / category rationale
+It is **Minor/Fixable** when the covariate is one of several already-adjusted and the result
+is robust; it escalates to **Major/Fatal** when it is imbalanced *and* on the causal path to
+the outcome *and* the primary estimate is near the significance boundary — then an
+unadjusted confounder could flip the conclusion. Tag **category C** (the estimate's
+validity), noting the model spec under E.
+## Fix
+Add the covariate to the model (or pre-specified sensitivity model), report the adjusted
+estimate with its CI, and state the change from the unadjusted estimate. If exclusion was
+deliberate (e.g., collider/mediator), say so and cite the DAG rationale. `fixable_by_ai:
+false` — requires re-running the model on the data.
+## R0-ready line
+> R0-C2 (Major, Fatal-if-flips): smoking is imbalanced (SMD 0.39) but unadjusted; add to the
+> model and report the change in the primary estimate. [gate: check_confounding_completeness]