npm - medsci-skills - Versions diffs - 4.1.0 - Mend

medsci-skills 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (702) hide show

package/skills/peer-review/references/domain-probes/observational_confounding.md ADDED Viewed

@@ -0,0 +1,48 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a confounding /
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Observational / Confounding probes (O1–O6)
+A 6-probe checklist for observational studies (cohort, case-control, cross-sectional, health-screening registry) where the central claim is an exposure–outcome association estimated by adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between what a manuscript *says* it adjusted for and what the exposure-stratified data show. O1 is data-checkable and the highest-yield probe — the self-review skill automates it as a deterministic gate (Phase 2.5e, `scripts/check_confounding_completeness.py`) that reads the exposure-stratified Table 1 and the Methods adjustment set.
+**O1 — Confounding completeness (measured-but-unadjusted)**:
+- Does the exposure-stratified baseline table (Table 1 by exposure) show covariates that are **significantly imbalanced** across exposure groups (p < 0.05, or a standardized mean difference > 0.1) yet are **absent from the adjustment set**?
+- A covariate that was measured, is imbalanced by exposure, and is a plausible cause of the outcome is residual confounding by a *measured* variable — the most preventable kind. Common offenders in metabolic / screening cohorts: smoking pack-years, uric acid, HDL, total cholesterol, HbA1c, eGFR.
+- Is the adjustment set justified (DAG, prior literature, or a pre-specified plan), or is it a short default (age/sex/BMI + a few comorbidities) that silently omits imbalanced labs?
+- Measured-but-unadjusted imbalanced covariate(s) → MAJOR. Recommend an extended-adjustment sensitivity model that adds the omitted covariates and reports whether the primary estimate is robust; the original model stays primary only if the extended model agrees.
+**O2 — Adjustment-set provenance (DAG vs Table-1-stepwise)**:
+- Was the adjustment set chosen by a causal structure (DAG / explicit confounder reasoning) or by a data-driven "include if Table 1 p < 0.05" / stepwise rule?
+- Data-driven selection risks both directions: **over-adjustment** for mediators or colliders (a variable on the causal path, or a common effect of exposure and outcome, biases the estimate) and **under-adjustment** for a confounder that happens to be balanced in this sample.
+- No stated rationale for inclusion/exclusion of each adjustment variable → MAJOR (the same model can be confounded and over-adjusted at once).
+**O3 — Selection / collider bias at enrollment**:
+- Is the cohort a self-selected or conditioned sample (health-screening attendees, survivors, a registry conditioned on having had the index test) such that enrollment is a collider opening a backdoor path?
+- Index-event bias (conditioning on a first event), immortal-time bias (exposure defined over a window during which subjects must survive), and prevalent-user bias addressed?
+- Unaddressed selection/collider structure that could generate the reported association → MAJOR; at minimum require an explicit selection-bias paragraph and, where possible, a sensitivity analysis.
+**O4 — Exposure measurement validity**:
+- Is the exposure a validated/quantitative measure or an unvalidated binary flag (e.g., a single reader's visual call, an ICD code, a self-report) with no in-cohort reliability (κ / ICC) and no severity gradient?
+- Structural-zero dose covariates: a dose/duration variable anchored to a categorical exposure (never-smoker → pack-years = 0, never-drinker → grams = 0) must be treated as a structural zero, not missing — misclassification here both mismeasures the exposure and (O5) collapses the analytic sample.
+- Non-differential misclassification biases toward the null (an underpowered null is not reassurance); differential misclassification can bias either way. Binary/unvalidated exposure with no reliability estimate → MAJOR (or a prominent limitation with a quantitative bias argument).
+**O5 — Missing-data mechanism & complete-case collapse**:
+- Is the missing-data mechanism (MCAR / MAR / MNAR) stated and justified, with the missingness fraction per key variable reported (ideally by exposure stratum)?
+- Does a dose/duration covariate (pack-years, cessation duration, alcohol grams) entering a complete-case multivariable model collapse n in the unexposed stratum (structural zeros dropped as missing), distorting subgroup estimates? Report n before and after model fitting.
+- If multiple imputation is used, are the mechanism assumption, the number of imputations, the imputation model, and a seed reported, and are structural zeros kept out of the imputation? Unjustified MAR for a large missing fraction, or an undisclosed complete-case collapse → MAJOR.
+**O6 — Residual confounding quantification (E-value)**:
+- Is an E-value (or a comparable quantitative-bias / negative-control analysis) reported for the **primary** estimate and its confidence limit, so a reader can judge how strong an unmeasured confounder would need to be to explain the association?
+- An E-value computed for a non-primary, supporting estimate but quoted as if it bounds the primary claim is a provenance error (the E-value must trace to the declared primary contrast).
+- For a non-null primary association presented as actionable with no residual-confounding quantification → MAJOR (request an E-value at the point estimate and the bound nearest the null); for a null primary, residual confounding is less load-bearing but power (see the power-aware null check) should be addressed instead.
+**Output template (O1 example)**:
+> "Table 1 shows that uric acid (p < 0.001), smoking pack-years (p = 0.001), HDL (p < 0.001), total cholesterol (p = 0.010), and HbA1c (p < 0.001) differ significantly across exposure groups, but the multivariable model adjusts only for age, sex, BMI, hypertension, and diabetes. Because these imbalanced laboratory covariates are plausible causes of the outcome, the reported association may carry residual confounding by measured variables. I'd suggest reporting an extended-adjustment sensitivity model that adds the imbalanced covariates and stating whether the primary estimate is materially unchanged; if the extended model attenuates the association, that should be reflected in the Abstract and Conclusions."
+**Output template (O5 example)**:
+> "The multivariable model appears to be complete-case, and pack-years is included as a continuous covariate. Because never-smokers carry a structural zero rather than a measured value, complete-case deletion can drop a large share of the unexposed stratum (here the analytic n falls from 5,203 to 1,993, with the female subgroup reduced to n ≈ 58), which distorts the subgroup estimates. I'd suggest adjusting for smoking status (never/former/current) rather than pack-years, reserving pack-years for an ever-smoker-restricted secondary analysis, and reporting the missingness fraction by exposure stratum with the MCAR/MAR/MNAR rationale."

package/skills/peer-review/references/domain-probes/radiomics.md ADDED Viewed

@@ -0,0 +1,38 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Radiomics / Feature-Reproducibility probes (R1–R4)
+A 4-probe checklist for radiomic feature reliability/reproducibility, acquisition–reconstruction parameter sweeps, and reliability/harmonization-based feature filtering claims. These probes complement (do not replace) the generic Phase 2 issue checklist. Their purpose is to keep design-level structural validity from being under-weighted: a review can correctly flag the reporting-layer issues (an over-claiming Abstract, a small external cohort) yet still miss whether the central contribution holds, which softens the assessment by one notch.
+**R1 — Design-grid circularity (in-domain "prediction" tautology)**:
+- Is an outcome (e.g., feature reliability) predicted from the very grid parameters that were systematically/exhaustively varied to construct the dataset?
+- If so, a high in-domain R² / accuracy is structurally guaranteed by the design ("predicting the construction recipe"), not a discovered relationship — do the predictors simply index the axes of the design grid?
+- Does the manuscript frame in-domain performance as a finding/success and lead the Abstract/Key Points with it?
+- If yes → do **not** endorse the in-domain success. Recommend reframing so the substantive finding is the cross-domain transportability (and, where present, its failure). MAJOR candidate.
+**R2 — Construct validity / proxy-target gap**:
+- The clinical rationale typically assumes that features which are reliable/stable/robust in the phantom are also better predictors of a biological/clinical target. A feature can be perfectly stable and biologically uninformative — this link is not logically guaranteed.
+- Is any post-filter performance gain shown to be signal recovery, rather than a by-product of removing a degraded/misaligned baseline feature space?
+- Does the manuscript acknowledge and test the orthogonality of the proxy (reliability) and the target (outcome)? Absent → MAJOR candidate.
+**R3 — Transportability framing vs reporting issue**:
+- When cross-phantom / cross-scanner / cross-center failure (negative R² on the target domain, low Jaccard overlap of selected features, calibration slope < 1) is the substantive result, is it nonetheless framed as a generalization success in the Abstract/Key Points/Conclusion?
+- Does the Results text state explicitly that a negative R² on the target domain means the model performs worse than predicting the mean (i.e., the mapping does not transport), rather than reading it as a weak continuous performance metric?
+- **Calibration link**: if in-domain "success" is partly a design artifact and the cross-domain result is a failure, reframing the Abstract will not rescue the central contribution. This is a design-level finding, not a reporting fix — keep its severity at the design level and do not soften it to a reporting issue.
+**R4 — Multiplicity (model × threshold / model × cohort grid)**:
+- Are multiple classifiers × multiple reliability thresholds (or cohorts) compared with one-sided tests, with a few reaching p < 0.05?
+- Is multiple-testing correction applied, and is the expected number of false positives by chance named explicitly (e.g., "5 models × 3 thresholds = 15 tests, ≈1 expected false positive")? Do not defer this to a generic "statistical review needed."
+- For a small external cohort (n ≤ ~30), do bootstrap ΔAUC intervals cross zero? If so, restrict any headline-gain claim accordingly (e.g., to a single classifier family in a small cohort).
+**Output template (R1 example)**:
+> "Because the acquisition parameters were varied as a systematic factorial grid, a model that predicts feature reliability from those same parameters is largely recovering the grid by construction; the in-domain R² ≈ 1.0 therefore reflects design structure rather than a discovered relationship. I'd suggest reframing the Abstract and Key Points so the substantive finding is the cross-phantom/cross-scanner transportability (and its failure), and stating explicitly in the Results that a negative R² on the target domain means the model performs worse than predicting the mean — i.e., the reliability mapping does not transport."
+**Output template (R4 example)**:
+> "The reported gains come from a grid of [N models] × [M thresholds] one-sided comparisons; with [N×M] tests, roughly one positive is expected by chance alone, and the external cohort (n = [k]) yields bootstrap ΔAUC intervals that cross zero for several thresholds. I'd suggest reporting a multiplicity-adjusted analysis (or stating the expected false-positive count), restricting the headline claim to the classifier family that survives, and marking the ΔAUC intervals that cross zero in the figure."

package/skills/peer-review/references/domain-probes/sr_ma.md ADDED Viewed

@@ -0,0 +1,87 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Systematic Review / Meta-Analysis probes (P0–P10)
+Internal-consistency-first gate (P0) plus a 10-probe checklist (P1–P10). These probes complement (do not replace) the generic Phase 2 issue checklist.
+**P0 — Internal-consistency-first gate (run before P1; gates any fabrication claim)**:
+- Before alleging fabrication on a manuscript that "feels AI-generated", reproduce the headline pooled statistics, paired study counts (k), and subgroup counts directly from the extracted data table (or supplement included-studies table).
+- If paired k, pooled medians, and subgroup counts reproduce, fabrication is unlikely — **pivot the review to table-vs-source fidelity (P1), comparator definition (P1), and eligibility**, not to a fabrication framing.
+- Only if the table cannot be reproduced, or is internally inconsistent, escalate to a transparency/integrity MAJOR.
+- Rationale: an "AI-smelling" surface is not evidence of fabrication. Real references can be present and the arithmetic coherent while the substantive flaws are extraction, comparator, eligibility, and overclaiming.
+**P1 — Performance-MA value + comparator-existence probe**:
+- For method-comparison MAs reporting accuracy / DSC / AUC / F1 (model-vs-model, AI-vs-reader, two training paradigms) and for DTA MAs reporting sensitivity / specificity, select ≥2 outlier or headline-driving studies.
+- (a) Verify each sampled arm value against the source paper (PubMed abstract or full text). For DTA cells, check for **sens/spec swap** (source sens=A% / spec=B% appearing in the forest as sens=B% / spec=A%).
+- (b) **Comparator-existence check**: verify the comparator arm is consistently defined and actually exists in each source. A baseline mislabeled as the comparator inflates the headline (e.g., a limited single-source baseline reported as a "centralised" comparator when the source paper has no centralised arm).
+- (c) Per-study schema: `Exists | Correct citation | Eligible (domain-specific) | Same comparator (same task/dataset) | Value matches source | Author-derived/averaged | Verdict`.
+- (d) Severity ladder: `<1pp rounding or author-derived average = minor`; `wrong dataset/task/comparator or not domain-specific = major`; `unfindable or wrong-citation = integrity concern (verify against source); potentially major`.
+- If a confirmed error drives a reported subgroup p-value or a headline claim, register as a primary major finding.
+**P2 — Cohort / benchmark non-independence probe**:
+- Identify clusters in included studies sharing: (a) institution name, (b) author surname + year proximity, (c) public ICU/EHR database (MIMIC-IV, eICU, MIMIC-III, KNHIS, UK Biobank, Optum, MarketScan, IBM), (d) **public imaging-challenge benchmark** (BraTS, FeTS, TCIA, Kaggle) reused across multiple included studies.
+- For each cluster, fetch PubMed efetch affiliation + abstract Methods database/benchmark source.
+- Flag pairs sharing the same data source + overlapping enrollment period (or the same public benchmark) as "high-confidence non-independence".
+- Manuscript should acknowledge in Limitations + perform a leave-one-dataset-out sensitivity analysis and add a data-provenance column to Table 1. If absent → MAJOR.
+- **Nuance**: map provenance and *request* the provenance column + sensitivity analysis; do NOT assert that a specific study used a given benchmark from coarse supplement labels alone (e.g., a supplement labeling a study only as "Hospital" or "Public" does not confirm BraTS/FeTS use). Confirm against the source before stating it.
+**P3 — Diagnostic subset N transparency (mixed DTA + prognostic MA)**:
+- Compute bivariate pool denominator (TP+FP+TN+FN) from Table 2 or forest plot.
+- Compare to total N reported in Abstract.
+- If diagnostic subset is <50% of total without explicit "diagnostic subset N = X / Y" in Results → MAJOR transparency gap.
+**P4 — k=1 subgroup flag**:
+- Inspect subgroup analyses for strata with k=1 (single included study).
+- If a reported subgroup p-value is driven by k=1 stratum → flag MAJOR.
+- Recommend reframing as exploratory or removing from formal subgroup test.
+**P5 — Supplementary completeness check**:
+- SR-MA supplementary must contain at minimum:
+  - PRISMA / PRISMA-DTA checklist with page refs
+  - Full-text exclusion list with reasons (per PRISMA 2020 item 16b)
+  - Per-study data extraction table
+  - Per-study × per-domain risk-of-bias table (QUADAS-2 / QUADAS-AI / PROBAST / PROBAST-AI)
+  - Full search strategy verbatim per database
+- If supplementary contains only figure captions or is missing 3+ of these → MAJOR.
+**P6 — PROSPERO ID format + live URL request**:
+- Standard PROSPERO format: `CRD42` + 4-digit YYYY + 6-digit sequential = 13 chars total. Some pre-2020 IDs are 12 chars (5-digit sequential).
+- IDs with >13 chars or non-numeric tail → FORMAT_ANOMALY (MAJOR).
+- Always request authors provide live registration URL in cover letter for protocol cross-check.
+**P7 — Reference duplicate detection** (extends `/verify-refs`):
+- Run `/verify-refs` (PubMed + CrossRef). In addition to standard checks, detect duplicate PMID or DOI within reference list.
+- Verbatim duplicates indicate LLM-assisted reference compilation error → MAJOR (cite renumbering required).
+**P8 — AI Disclosure presence**:
+- `grep -iE "chatgpt|gpt-|llm|generative ai|ai was used|ai-assisted|copilot|claude|gemini|chatbot|large language model"` on manuscript body.
+- If 0 matches AND journal requires AI Disclosure (RYAI / Radiology / RSNA family / Lancet family / JAMA family / most BMJ family / Nature family) → flag MINOR-to-MAJOR.
+**P9 — Non-significant finding promoted to Abstract (overclaim probe)**:
+- Flag any exploratory or non-significant result (a crossover, a trend, a post-hoc subgroup) that appears in the Abstract or Key Points framed as a finding.
+- Sub-check: does the promoted finding depend on a study flagged or mis-extracted under P1? (A headline crossover can collapse once a mis-extracted comparator is corrected.)
+- Flag "non-inferiority" / "equivalence" asserted without a pre-specified margin. A margin cannot be pre-specified retrospectively — ask the authors to document any pre-existing protocol margin, otherwise drop the non-inferiority language or present it explicitly as a post hoc equivalence / sensitivity analysis.
+**P10 — Citation-metadata confusion class (over-escalation guard)**:
+- DOI-suffix digits that surface as an apparent article number (e.g., a DOI tail "77196" against article number 26068, or "60466-1" against 6274) are cosmetic metadata confusion, **not** fabrication — do not escalate them as fabricated references.
+- Reference-list duplicates are handled by `/verify-refs` (`duplicate_findings[]`); AI-disclosure presence is the cross-cutting P8 check. Neither is unique to SR/MA.
+**Output template (P1 cell-swap example)**:
+> "I spot-checked [Author Year] (PMID [...]) against the source paper and found that the values in Figure X are swapped. The source paper reports external-test sensitivity A% / specificity B% (n=N); the manuscript forest entries place [num1/denom1] in the sensitivity slot (which is the source's specificity numerator/denominator) and [num2/denom2] in the specificity slot (which is the source's sensitivity)."
+**Output template (P1 comparator-existence example)**:
+> "I spot-checked [Author Year] (PMID [...]) against the source. The manuscript lists this study's comparator ('[label]', [value]) in [comparison], but the source paper does not report that arm; the [value] appears to be the study's [limited single-source baseline]. Because this entry contributes to [the pooled comparison / a headline claim], I'd suggest re-extracting the comparator definition per study and adding a comparator-definition column to Table 1 so readers can confirm each arm is the same task on the same data."
+**Output template (P2 example)**:
+> "[Author1 Year1] uses [Database] (N=...). [Author2 Year2] uses [Database] (N=...). These are nearly certainly overlapping patient pools, and the statistical independence assumption for MA pooling is violated. I'd suggest a sensitivity analysis excluding one of the two studies, plus an explicit cohort-source column in Table 1."
+**Discipline — leads vs findings (applies to every P0–P10 probe)**:
+- Output from a forensic sub-agent or automated scan is a **lead, never a finding, until confirmed against the source.** Concrete failure modes to discard on inspection: treating recent (in-press / current-year) publication dates as "impossible", inventing journal article-number rules, and inflated all-or-nothing fabrication-risk scores.
+- Before finalizing, run an **overclaim sweep of your own draft** (mandatory external-QC pass — independent model or colleague). Two worked examples: a strong claim that "the references are real, not fabricated" should be narrowed to "the sampled references / DOIs resolved"; a benchmark example list should be trimmed to studies whose benchmark use was source-confirmed.
+- **Do not compute chance-probabilities** for suspicious or identical values. Record the observation neutrally: "exact match to ≥2 decimals; source verification pending."

package/skills/peer-review/references/domain-probes/survival_prognostic.md ADDED Viewed

@@ -0,0 +1,68 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Survival / Prognostic Model probes (S1–S8)
+An 8-probe checklist for time-to-event outcomes and prognostic model development. These probes complement (do not replace) the generic Phase 2 issue checklist and may be co-applied with the SR-MA probes for a meta-analysis of prognostic models.
+**S1 — Conditioning / causal framing**:
+- Does the manuscript claim a "preoperative" / "screening" / "triage" / "X replaces Y" use case while outcomes are conditioned on the downstream treatment whose value the model is supposed to inform?
+- Inputs include post-decision variables (resection margin status, adjuvant chemo/radiotherapy, transplant status) that are unknown at the claimed decision point?
+- Non-treatment comparator or causal framework present?
+- Conditioning gap → MAJOR candidate. Recommend retrain without leaky variables / add non-treatment arm / reframe intended use.
+- **Time origin & survivorship** (incident / transition models): is the at-risk clock started at the correct origin for each incident model, with immortal time (a span in which the event cannot occur, misattributed to one group) and left-truncation / delayed entry handled? Is a "progressor" / transition label conditioned on *surviving to* a later ascertainment (a second scan, a follow-up visit) — a survivorship that needs a landmark time or an explicit intermediate-state model? If the primary analysis is **not** the full cohort (e.g., complete-case while a large fraction is missing) and the complete-case model is the significant one, that selection needs a stated justification and a MAR rationale — an outcome-dependent choice of the analysis set is the S8 concern. Any of these unhandled → MAJOR.
+- **Self-confession escalation**: a Methods or Limitations admission that a time-origin, immortal-time, return-conditioning, or selection issue was *"not formally assessed"* (or equivalent) is itself a MAJOR — it names a known bias that was left unaddressed, not a mitigated limitation.
+**S2 — Censoring handling in training loss**:
+- Cox partial-likelihood loss or DeepSurv-style loss specified? How is censoring handled (right-censoring, interval-censoring, informative censoring by death)?
+- If Methods describe a Cox or partial-likelihood loss but do not specify censoring treatment, register as MAJOR (reproducibility).
+- Covariate incompleteness in the model fit is part of the same disclosure: is a structural-zero covariate (a never-smoker's pack-years = 0 by definition, not missing) handled as a zero, or dropped under complete-case so the unexposed stratum and the events-per-variable silently collapse? An undisclosed complete-case collapse from a dose/duration covariate is a reproducibility + power issue — adjust on the categorical status and reserve the continuous dose for an exposed-only secondary analysis.
+**S3 — Competing risks**:
+- 2+ event types (local recurrence + distant metastasis + death, or cause-specific mortality) modeled?
+- Cause-specific hazards or Fine-Gray subdistribution hazards used?
+- Patient developing one event still at risk for the other (informative censoring by death)?
+- If competing-risks structure is ignored and outcomes are treated as independent right-censored events → MAJOR.
+**S4 — Cutoff derivation optimism**:
+- Cutoffs derived via maximally selected log-rank statistics, AUC-based Youden's J, or similar data-driven methods?
+- Hothorn-Lausen correction or equivalent optimism correction applied?
+- Was the same cohort used for both model selection (hyperparameter tuning) AND cutoff selection? (Optimism bias)
+- Bootstrap optimism estimate or sensitivity analysis on cutoff choice (e.g., ±0.5 SD perturbation)?
+- Same-cohort dual use without correction → MAJOR.
+**S5 — Comparator horizon alignment**:
+- External baseline prognostic nomogram (commonly designed for 5- or 10-year endpoints) applied as the comparator?
+- Manuscript's available follow-up duration aligned with that horizon?
+- Mismatch → baseline C-index degradation may reflect design-horizon mismatch ≠ intrinsic inferiority. Recommend time-dependent C-index or time-stratified analyses.
+- Baseline implementation specified: applied as published, locally recalibrated, or refit as a new Cox model with similar variables?
+- Unclear implementation → MAJOR (a refit local model should be described as a clinicopathologic comparator, not a "guideline model").
+**S6 — C-index variant + reverse Kaplan-Meier follow-up**:
+- Which C-index variant: Harrell's C, Uno's C, time-dependent AUC, IPCW-C?
+- Variant appropriate for the censoring distribution and sample size?
+- Time-dependent AUC at a clinically anchored horizon (e.g., 2-year, 3-year) reported alongside Harrell's C?
+- Reverse Kaplan-Meier median follow-up reported per cohort and per outcome (LR vs DM separately) with censoring date?
+**S7 — Calibration beyond discrimination**:
+- Calibration plot (intercept / slope) across all cohorts?
+- Brier score / Integrated Brier Score (IBS)?
+- Decision-curve analysis at clinically relevant probability thresholds?
+- For a prognostic model intended to guide surveillance intensity, treatment intensification, or eligibility for adjuvant therapy, discrimination alone is insufficient. If Methods mention calibration but Results/supplement contain no calibration plot or numeric metrics → MAJOR.
+**S8 — Estimand provenance**:
+- Is the survival estimand stated explicitly and held consistent across Abstract / Methods / Results — event-free survival, cause-specific cumulative incidence, all-cause mortality — and at the subject vs population level? A subdistribution hazard (Fine-Gray) answers a different question than a cause-specific hazard; quoting an sHR for an etiologic claim, or a cause-specific HR for an absolute-risk claim, is an estimand mismatch.
+- Is the evaluation horizon (2-/3-/5-year) and the primary model fixed in advance and consistent with the registered/pre-specified primary endpoint, or was the primary endpoint, model, or horizon re-designated after the results were known (outcome-dependent primary selection)?
+- Does every derived statistic (E-value, an sHR-vs-cause-specific-HR contrast) trace to the *declared primary* estimand, or is a supporting/non-primary estimate quoted as if it bounded the headline claim?
+- Estimand drift — a primary re-designated post-hoc, or a derived statistic computed on a non-primary estimate but presented as primary → MAJOR. Recommend reporting the pre-specified and revised models coequally, disclosing the change, and recomputing any E-value for the primary estimate. (The self-review skill automates the registration ↔ manuscript and E-value arithmetic checks as Phase 2.5f, `scripts/check_claim_artifact.py`.)
+**Output template (S4 example)**:
+> "The Methods (p. X) state that optimal cutoffs for [outcome] were determined via maximally selected log-rank statistics on the internal validation cohort. Two concerns: (a) Hothorn-Lausen correction is cited but it is unclear whether the corrected p-value was used in the cutoff selection; (b) the internal validation cohort appears to have been used for both model selection and cutoff selection, which is a known source of optimism. I'd suggest reporting bootstrap-based optimism estimates or a sensitivity analysis showing how external performance shifts under ±0.5-SD perturbation of the chosen cutoff."
+**Output template (S5 example)**:
+> "The chosen baseline nomogram was originally designed and validated for prediction of long-horizon endpoints (5- and 10-year). In this study, median follow-up in [external cohort] is substantially shorter than that horizon, so the comparator's apparent underperformance may partly reflect a horizon mismatch rather than intrinsic inferiority. I'd suggest (a) stating explicitly the time horizon at which both models were evaluated, (b) reporting time-dependent C-indices at a clinically anchored horizon, and (c) clarifying whether the comparator was applied as published, recalibrated locally, or refit as a new Cox model with similar variables."

package/skills/peer-review/references/exemplar_reviews/README.md ADDED Viewed

@@ -0,0 +1,43 @@
+# Exemplar reviewer comments — anchoring + phrasing models
+The skill teaches *what* to look for (signature checks + domain probes) and *what tone*
+to use (`aczel_2021_reviewer2_patterns.md`), but until now carried no worked examples of
+how an experienced reviewer turns a finding into a comment. This directory fills that gap.
+Each file models one recurring finding for medical-AI / clinical papers and shows the
+same four moves a strong review makes:
+1. **Anchor** — name the exact location (section, figure/table, page) the concern sits in.
+2. **State the gap** — what is claimed vs what the evidence supports, concretely.
+3. **Phrase it as a partner** — hedged, first-person, critique-the-work-not-the-author
+   (Aczel-compliant: "I'd suggest…", "it would help to…", never "the authors fail to…").
+4. **Calibrate severity** — when the finding is design-level it becomes Major #1; when it
+   is fixable-as-reported it stays a Minor; the example says which and why.
+## How these are used
+The reviewer (or `/self-review`) reads the relevant exemplar before drafting Phase 3
+comments, to model the anchoring + phrasing — not to copy text. They are **teaching
+models, authored from scratch**, not extracted from any published review.
+## Contents
+- `ai_overclaiming.md` — generalizability / "outperforms" / "can replace" claims that
+  outrun single-center or single-reader evidence.
+- `reference_standard_validity.md` — an imprecise, unblinded, or mistimed reference
+  standard in a diagnostic-accuracy study.
+- `data_leakage.md` — patient-level split violations and label-bearing input features.
+- `calibration_missing.md` — discrimination (AUC) reported for a clinical prediction
+  model with no calibration assessment.
+## Curator guidelines (for adding more)
+- **Synthetic only.** Author the example; never paste a real reviewer's words or a real
+  manuscript's text. Use placeholder study details ("a single-center retrospective
+  cohort", "Figure 3").
+- **One finding per file**, showing the four moves above end to end.
+- **Show both the weak and the strong phrasing** so the contrast is teachable.
+- **Tie severity to the fatal-flaw hierarchy** the skill already uses (design-level →
+  Major #1; reporting-level → Minor).
+- Keep each file ~40–80 lines. Cross-reference the relevant domain probe or signature
+  check by name, not by copying it.

package/skills/peer-review/references/exemplar_reviews/ai_overclaiming.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Exemplar — AI overclaiming relative to the evidence
+**Finding class:** the conclusion's reach (generalizable / outperforms / can replace)
+exceeds what a single-center, single-reader, or internally-validated result supports.
+**Typical severity:** design-/framing-level → **Major #1** when the headline claim depends
+on it; Minor when it is only a stray adjective in the Discussion.
+## What the reviewer noticed
+The Abstract and Conclusion state the model "generalizes across institutions" and
+"outperforms radiologists," but the external test set is one site (Methods, "External
+validation"), the reader comparison used two readers on a different task tempo than the
+model (Table 3), and the confidence intervals for model vs reader AUC overlap (Figure 2).
+## Weak phrasing (avoid)
+> The authors overclaim. Saying the model generalizes and beats radiologists is not
+> justified and should be removed.
+(Verdict without an anchor, no path forward, gatekeeper tone.)
+## Strong phrasing (model this)
+> The Conclusion states the model "generalizes across institutions," but external
+> validation appears limited to a single site (Methods, *External validation*). I'd
+> suggest softening this to the evidence — e.g., "validated at one external site" — and
+> framing multi-institution generalizability as a stated limitation and next step.
+>
+> Relatedly, the "outperforms radiologists" claim rests on a comparison whose 95% CIs for
+> model and reader AUC overlap (Figure 2), and the reader task differs from the model's in
+> [tempo/inputs] (Table 3). It would strengthen the paper to (a) report the difference in
+> AUC with its CI and a test of that difference rather than two separate AUCs, and (b)
+> state explicitly which clinical task the comparison establishes. If the difference is
+> not statistically supported, I'd recommend reframing from "outperforms" to
+> "comparable to," which is still a meaningful and more defensible result.
+## Why this is Major #1 here
+The over-reach is in the Abstract and Conclusion and is the paper's headline, so a reader
+takes away a claim the data do not support. Anchoring it to the single-site external set
+and the overlapping CIs makes the fix concrete and keeps the contribution intact.
+## Related checks
+Signature check "Overclaiming vs evidence level"; self-review category D
+(endpoint↔conclusion scope); `check_scope_coherence.py` for surrogate→care-directive and
+cross-sectional→prognostic variants.

package/skills/peer-review/references/exemplar_reviews/calibration_missing.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Exemplar — calibration not reported for a clinical prediction model
+**Finding class:** the model is presented for clinical decision-making, but only
+discrimination (AUC/c-statistic) is reported — no calibration, and often no decision-curve
+or clinical-utility analysis.
+**Typical severity:** **Major** when the paper proposes clinical use (a probability that
+drives a decision must be calibrated); **Minor** when the model is framed as a research
+prototype only.
+## What the reviewer noticed
+Results report AUC with CIs for every model (Table 2, Figure 1), and the Discussion
+proposes using predicted probabilities to triage patients, but there is no calibration plot
+or metric (calibration slope/intercept, ECE) and no decision-curve analysis.
+## Weak phrasing (avoid)
+> AUC is not enough; the authors need calibration.
+(True but terse; no reason, no anchor, no path.)
+## Strong phrasing (model this)
+> The discrimination results are clearly presented (Table 2, Figure 1). Because the
+> Discussion proposes using the predicted probabilities to triage patients, it would help
+> to add an assessment of calibration — how close predicted probabilities are to observed
+> frequencies. AUC can be high while probabilities are systematically too confident, which
+> would mislead a probability-based threshold. A calibration plot with slope and intercept
+> (and ideally on the external set) would let readers judge whether the proposed thresholds
+> are safe. A decision-curve analysis would further show the net benefit across plausible
+> threshold probabilities relative to treat-all / treat-none. If recalibration was applied,
+> stating the method would also be useful.
+## Severity calibration
+If the paper's contribution is a deployable triage tool, calibration is **Major** — the
+clinical claim rests on the probabilities being trustworthy. If the model is explicitly a
+methods demonstration with no decision claim, a calibration metric is a reasonable
+**Minor** addition.
+## Related checks
+Signature check "Calibration (AUC alone insufficient)"; TRIPOD+AI / CLAIM calibration items
+via `/check-reporting`; self-review category C (calibration [CRITICAL]).

package/skills/peer-review/references/exemplar_reviews/data_leakage.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Exemplar — data leakage (split contamination / label-bearing inputs)
+**Finding class:** information from the test set or the outcome reaches the model at
+training time — via a patient appearing in both train and test, or via an input feature
+that encodes the label.
+**Typical severity:** design-level → **Major #1**, because leakage inflates every reported
+metric and is often invisible in the results themselves.
+## What the reviewer noticed
+Methods report a 70/30 split "by image," but many patients contributed multiple images
+(Table 1, mean 2.4 studies/patient), so the same patient can fall on both sides of the
+split. Separately, one input feature is the radiology report impression text, while the
+outcome is the report-derived diagnosis — the input may contain the label.
+## Weak phrasing (avoid)
+> There is leakage so the model is invalid.
+(Asserts the conclusion; gives the authors nothing to act on.)
+## Strong phrasing (model this)
+> Two points about independence between training and evaluation that, if addressed, would
+> considerably strengthen confidence in the metrics:
+>
+> First, the 70/30 split is described as by image (Methods, *Data split*), but patients
+> contributed multiple studies (Table 1). A patient-level split prevents the model from
+> seeing the same patient in training and test; an image-level split can inflate
+> performance. I'd suggest re-splitting by patient and reporting whether the metrics change
+> — if they are stable, that is a strong robustness result to include.
+>
+> Second, one input is the report impression text while the outcome is derived from the
+> report (Methods, *Inputs* / *Outcome*). If the impression states the diagnosis, the model
+> may be reading the label rather than predicting it. Could the authors either mask the
+> diagnosis-bearing portion of the text, or report a sensitivity analysis with that feature
+> removed, to show the result does not depend on it?
+## Why this is Major #1
+Leakage does not announce itself in the results — the metrics look excellent precisely
+because of it. Anchoring to the multi-study patients (Table 1) and the report-derived
+outcome makes the concern concrete and gives two specific, runnable remedies.
+## Related checks
+Signature check "Patient-level data splitting"; self-review categories A (independence/
+leakage) and H (circularity / label-feature overlap).

package/skills/peer-review/references/exemplar_reviews/reference_standard_validity.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Exemplar — reference-standard validity in a diagnostic-accuracy study
+**Finding class:** the reference standard ("ground truth") is imprecisely defined, applied
+non-independently of the index test, mistimed, or read without blinding — so the reported
+sensitivity/specificity may be biased.
+**Typical severity:** design-level → **Major** (often Major #1 for a DTA paper), because it
+threatens the headline accuracy estimates.
+## What the reviewer noticed
+Methods describe the reference standard as "clinical consensus," but it is not stated who
+adjudicated, whether they were blinded to the index test, or the time interval between
+index test and reference. STARD items on the reference standard and blinding are not
+addressed.
+## Weak phrasing (avoid)
+> The ground truth is unreliable, so the results cannot be trusted.
+(Global dismissal, no anchor, no remedy.)
+## Strong phrasing (model this)
+> The reference standard is described as "clinical consensus" (Methods, *Reference
+> standard*), but a few details would let readers judge the accuracy estimates. Could the
+> authors specify: (1) who adjudicated and their expertise; (2) whether adjudicators were
+> blinded to the index-test result — this matters because unblinded adjudication can pull
+> the reference toward the index test and inflate agreement; (3) the interval between the
+> index test and the reference, since a long gap allows disease progression/regression
+> (verification timing); and (4) for cases where consensus was used, the inter-rater
+> agreement before consensus. If blinding was not feasible, a sentence on the expected
+> direction of bias and a sensitivity analysis on the subset with histologic confirmation
+> would reassure readers.
+## Severity calibration
+If adjudication was unblinded and the reference partly incorporates the index test, this is
+incorporation bias and belongs as **Major #1** — the sensitivity/specificity are the
+paper's product. If only the *reporting* is thin (the design is sound but undescribed),
+it is a **Minor** request for Methods detail.
+## Related checks
+Signature check "Reference standard precisely defined"; QUADAS-2 domains (reference
+standard, flow & timing) via `/check-reporting`; self-review category B.

package/skills/peer-review/references/narrative_review_audit.md ADDED Viewed

@@ -0,0 +1,67 @@
+# Narrative / Review-Article Audit — Reference Material
+Supporting material for the Phase 2D (RV1–RV8) review-article audit in `SKILL.md`. It collects (1) the SANRA appraisal items, (2) a consolidated evaluation checklist, and (3) a candidate-additions list for AI/LLM-in-radiology reviews. Nothing here is a hard requirement; it is an appraisal aid for a *narrative* (non-systematic) review.
+## 1. SANRA appraisal items (paraphrased)
+SANRA (Scale for the Assessment of Narrative Review Articles) is a brief **critical-appraisal tool, not a reporting guideline** — it scores narrative-review quality and is not meant to be enforced like PRISMA. Six items, each scored 0–2 (paraphrased; see source for exact wording):
+1. **Importance** — the review explains why the topic matters to the readership.
+2. **Aims** — the review states its aims/questions.
+3. **Literature search** — the review describes how sources were identified (databases, terms, time window). For a narrative review this is a transparency *suggestion*, not a systematic-search requirement.
+4. **Referencing** — claims are supported by appropriate, accurate, primary-source citations.
+5. **Scientific reasoning** — evidence is appraised and synthesized critically rather than selectively reported.
+6. **Endpoint data** — relevant data are presented appropriately (tables, figures, summary elements).
+Source: Mertens S, Goldbeck-Wood S, Baethge C. *SANRA — a scale for the quality assessment of narrative review articles.* Research Integrity and Peer Review (2019). https://doi.org/10.1186/s41073-019-0064-8
+> Mapping to Phase 2D probes: RV2 ↔ items 1–2; RV3 ↔ item 3 (suggestion-level); RV6 ↔ items 4–5. RV1 (novelty) and RV7 (load-bearing figures/tables) are **editorial value-add axes**, not SANRA items — keep them separate so SANRA is not over-applied.
+## 2. Consolidated evaluation checklist
+Structural / appraisal (SANRA-aligned):
+1. Importance and knowledge gap are established.
+2. Aims and scope boundaries are explicit (what is in/out).
+3. Evidence-gathering is transparent enough to judge balance (suggestion for narrative reviews).
+4. Citations are accurate and primary-source-weighted.
+5. Evidence is appraised critically; conflicting findings are handled fairly.
+6. Key data are summarized in tables/figures/summary elements.
+Content / rigor:
+7. Coverage is comprehensive and current for the field.
+8. Treatment is balanced and objective (no cherry-picking toward a predetermined conclusion).
+9. Novelty / value-add over existing reviews is articulated.
+10. Findings are attributed to primary sources (no propagation of secondary-source errors).
+Accessibility / utility:
+11. Clear for the target audience; jargon is explained.
+12. Actionable takeaways (clinical implications, future directions).
+13. Logical organization and flow.
+Domain-specific (AI/LLM reviews):
+14. Technical and medical claims are accurate.
+15. Relevant frameworks/standards are acknowledged where they support the argument.
+## 3. Candidate additions for AI/LLM-in-radiology reviews
+These are **candidate additions** a reviewer may raise with "consider adding X because it directly supports Y" phrasing — never "must cite," and proportionate to length. Items are tiered by publication status so a preprint is not equated with a peer-reviewed guideline. Verify each against the manuscript's existing coverage before suggesting it (do not request additions already present).
+**Reporting frameworks (peer-reviewed):**
+- TRIPOD-LLM — reporting for studies using large language models (Nature Medicine).
+- MI-CLAIM-GEN — minimum-information checklist for generative clinical AI (Nature Medicine).
+- STARD-AI — diagnostic-accuracy reporting for AI (Nature Medicine).
+- CLAIM (2024 update) — checklist for AI in medical imaging (Radiology: Artificial Intelligence).
+**Reporting frameworks (preprint — label as such):**
+- Treat any not-yet-peer-reviewed arXiv/medRxiv checklist as a preprint and label it as such; do not equate it with a peer-reviewed guideline. Verify status before citing, since such items are frequently published later (the example above, MI-CLAIM-GEN, first appeared on arXiv and was subsequently published in Nature Medicine).
+**Concepts and tooling commonly expected in a hallucination-in-radiology review:**
+- Hallucination taxonomy: intrinsic vs extrinsic; faithfulness vs factuality.
+- Retrieval-augmented generation (RAG): shifts the dominant failure mode from fabrication toward retrieval error rather than eliminating hallucination.
+- Uncertainty / confidence calibration for clinical decision support.
+- Radiology-specific evaluation: report-level metrics (e.g., RadGraph, CheXbert/CheXpert-F1) and hallucination-detection work for generated reports.
+- Regulatory and deployment context: FDA 510(k) / CE pathways; deployment-assessment rubrics (e.g., RADAR).
+## Principle
+For a narrative review, error-spotting (RV4/RV5/RV6) is necessary but not sufficient. Identifying thematic gaps and proportionately suggesting missing literature/topics (RV8) is an expected part of the reviewer's role — the inverse of the scope-creep restraint applied to original research. New *studies* are still not requested; only missing *literature/topics* are.

package/skills/peer-review/references/reviewer_calibration/README.md ADDED Viewed

@@ -0,0 +1,34 @@
+# Reviewer calibration — turning a compliance % into a judgment
+`/check-reporting` reports how many guideline items are PRESENT / PARTIAL / MISSING and a
+compliance percentage. A percentage alone does not answer the reviewer's actual question:
+**is this manuscript reporting-complete enough, and which gaps are serious?** This directory
+holds that judgment layer.
+It is deliberately **not** in `reviewer_profiles/` — that directory is "form fields, not
+opinions" (the scorecard a journal's editorial system shows). Calibration is opinion, so it
+lives here, as the reviewer's own guideline.
+## What it provides
+- `compliance_floor.md` — the principle that **critical-item presence outranks the overall
+  percentage**, a per-guideline list of items that are reject-risk when MISSING regardless
+  of the headline %, and the desk-rejection-risk signals to weigh.
+## How it is used
+In `/peer-review` Phase 2 (reporting-guideline check) and `/self-review` (category G), after
+`/check-reporting` produces its table: don't stop at the percentage. Check that each
+**critical item** for the study type is PRESENT; a 90%-compliant manuscript that is missing
+a critical item is weaker than an 80%-compliant one that has them all.
+## Boundaries (read before adding)
+- **No fabricated thresholds.** This file does not assert "journal X desk-rejects below
+  Y%." Journals rarely publish a numeric floor. The *only* hard signals are (a) a missing
+  critical item and (b) the journal's own stated required elements (which live in
+  `reviewer_profiles/` and the author guidelines — verify there, do not invent).
+- **Critical-item lists are methodological judgment**, grounded in the guideline's own item
+  set (public), not in any single manuscript or review.
+- Keep it general and study-type-keyed; per-journal specifics belong in `reviewer_profiles/`
+  and are cited, not duplicated here.