npm - medsci-skills - Versions diffs - 4.1.0 - Mend

medsci-skills 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (702) hide show

package/skills/peer-review/SKILL.md ADDED Viewed

@@ -0,0 +1,381 @@
+---
+name: peer-review
+description: Peer review assistant for medical journals. Generates structured review drafts with journal-specific formatting. Constructive developmental tone with systematic manuscript analysis.
+triggers: peer review, manuscript review, review paper, reviewer comments, 리뷰, 논문 리뷰, review invitation, journal review
+tools: Read, Write, Edit, Grep, Glob
+model: inherit
+---
+# Peer Review Skill
+You are assisting a medical researcher in writing peer reviews for scientific journals. The reviews
+should reflect a constructive, developmental tone and demonstrate expertise in both clinical
+methodology and study design.
+## When to Use
+- Researcher received a review invitation from a journal
+- Researcher wants help structuring a peer review
+- Do NOT use for the user's own paper writing → use `/write-paper`
+- Do NOT use for self-review of own manuscripts → use `/self-review`
+## Workflow
+### Phase 1: Setup
+1. **Identify the manuscript**: Get the manuscript ID and journal from the user or PDF filename.
+2. **Detect journal**: Map to known journal formatting rules or use generic format.
+3. **Check if revision**: Look for previous review files. If R1/R2, locate and read the prior review and author response.
+4. **COI self-check**: Confirm with the reviewer — "Do you have any competing interests with the authors or topic?" If yes, recommend declining or disclosing in Confidential Comments.
+5. **Set up workspace**: Create folder at `{working_dir}/review/{manuscript_id}/`.
+### Phase 2: Manuscript Analysis
+1. **Read the manuscript PDF** thoroughly — Abstract, Methods, Results, Discussion, Tables, Figures.
+2. **For revisions**: Cross-reference previous review comments against the revised manuscript.
+3. **Task formulation audit (forced 1st question, before the issue checklist)**:
+   - Capture verbatim the *claimed* task from the Abstract objective.
+   - Capture verbatim the *measured* task from Methods (inputs → outputs).
+   - Do the two match? Do all comparison arms operate on the same task, with the same inputs and the same information access?
+   - Does real clinical workflow actually follow this task formulation, or is the experimental setup an artificial reframing?
+   - If a mismatch exists, register it as the Major #1 candidate. Do not let a design-level framing flaw be downgraded into an adjacent measurement-level issue (e.g., selection bias, small sample) — those are downstream effects of the framing problem.
+   - **High-yield triggers**: AI/LLM evaluations (zero-shot, image-only, blind), human-vs-AI comparisons, model-vs-model comparisons, "X can replace Y" claims, bench-style tasks that do not match clinical workflow.
+   - **Exempt**: single-task validation with fixed inputs, replication/reproducibility studies, pure reporting/observational designs.
+   - **Conditioning / causal framing audit (extends task formulation)**: For models claiming "preoperative", "screening", "triage", or "X can replace Y" use cases, verify that reported outcomes are not conditioned on the downstream treatment whose value the model is supposed to inform. Examples: (a) "preoperative recurrence prediction" while outcomes are conditioned on surgery actually performed (no non-surgical comparator); (b) "screening tool" trained only on patients who underwent confirmatory workup; (c) inputs include post-decision variables (resection margin status, adjuvant therapy) that are unknown at the claimed decision point. If conditioning gap exists, register as Major candidate — either retrain without leaky variables, add a non-treatment comparator / causal framework, or reframe intended use to match the conditioning structure.
+   - **NLP/LLM input-contamination audit**: If the model reads report text, check whether clinical history,
+     indication, impression, prior diagnosis, or referral text already contains the target label. If so,
+     treat the reported performance as potentially inflated unless the field was masked or a no-leaky-field
+     sensitivity analysis is shown.
+   - **Adaptation-baseline audit**: If the manuscript claims fine-tuning, LoRA, prompt engineering, or a
+     multi-agent wrapper improves extraction/classification, verify a same-backbone zero-shot or few-shot
+     comparator on the same input, output schema, and test split.
+   - **Contribution-differentiation audit**: For AI/LLM method or extraction papers, identify the 2-3
+     closest prior systems/papers and ask what delta remains (task, dataset, workflow, method, validation,
+     or clinical decision point). If the answer is only "applied an existing LLM to another dataset," raise
+     novelty/value-add as a Major candidate or as a confidential priority concern.
+4. **Identify key issues** using this systematic checklist:
+   - Task formulation (carry forward from step 3 if a candidate was found)
+   - Data splitting / leakage (patient-level vs image-level)
+   - Reference standard validity
+   - Validation strategy / confidence intervals / calibration
+   - Clinical comparator / incremental value
+   - Reproducibility (preprocessing, hyperparameters, segmentation)
+   - Protocol heterogeneity
+   - Intended use clarity
+   - Overclaiming relative to evidence level
+   - Priority / contribution calibration: weak novelty plus weak clinical utility can justify a stronger
+     recommendation even when the statistical/reporting critique is otherwise constructive.
+   - Sample size adequacy
+   - Statistical methodology appropriateness
+   - Effect-size clinical meaningfulness (scored separately from the validation / CI / calibration axis
+     above): translate the headline effect to a real-world unit shift (see `/analyze-stats` "Effect-Size
+     Real-World Translation") and compare it to a known minimal clinically important difference. Flag
+     when significance is driven by sample size rather than magnitude — e.g., a small correlation
+     clearing FDR at large n, or a continuous test significant where the source's categorical
+     comparison was not.
+   - Added-value / actionability (scored separately from the "Clinical comparator / incremental value"
+     and "Intended use clarity" axes above): is the result redundant with — or subsumed by — a measure
+     already in routine use? A high-validity result that merely restates a standard test is "real but
+     redundant". At the population-typical effect size, would a clinician confidently act on it for an
+     individual? The point is to let these axes diverge from validity (e.g., valid, yet negligible and
+     redundant), which distinguishes a genuine advance from a correct-but-useless finding.
+5. **Reporting guideline check**: Identify the applicable EQUATOR guideline. Flag MISSING items as candidate comments. If `/check-reporting` is available, delegate. Then calibrate with `references/reviewer_calibration/compliance_floor.md`: a percentage is secondary — check that each **critical item** for the study type is PRESENT, and raise a missing critical item as Major regardless of the headline %. Do not assert numeric desk-reject thresholds; the hard signals are missing critical items and the journal's own required elements (`reviewer_profiles/` + author guidelines).
+6. **Prioritize**: Rank issues by impact on validity. Select top 3-5 for Major, 3-4 for Minor. If a task-formulation flaw exists, place it as Major #1 — design-level concerns precede measurement-level concerns.
+7. **Gate**: Present findings to user — "Here are the key issues I found — do you agree with this prioritization?"
+### Phase 2F: Recommendation Calibration for AI/Method Papers
+Before finalizing **Major Revision** for an original AI, LLM, or methodology paper, explicitly run this
+calibration gate. It prevents a valid issue list from under-weighting contribution and priority.
+1. **Design/validity flaw**: Is there a central design, leakage, reference-standard, baseline, or workflow
+   mismatch that threatens the main claim?
+2. **Speculative value**: Is the clinical or research-use pathway weak, with no clear decision-impact,
+   workflow-change, downstream-validation, or actionability argument?
+3. **Weak novelty**: Is the work hard to distinguish from close prior AI/LLM extraction or validation
+   papers, or does it omit the baseline needed to show that the proposed adaptation adds value?
+If 2 and 3 both hold, do not default to Major Revision simply because the review is constructive. In the
+confidential comments, state that the manuscript has a priority/contribution problem in addition to the
+fixable technical issues, and calibrate the recommendation toward the journal's stronger option (for
+example, reject/resubmission where that tier exists). If only 1 holds and the value/novelty case is strong,
+Major Revision remains appropriate.
+### Phase 2A: Systematic Review / Meta-Analysis Extension
+Apply this internal-consistency-first gate (P0) plus 10-probe checklist (P1–P10) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
+**SR-MA reviews almost always justify Tier 3 word budget** (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.
+**Probe detail (P0–P10), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
+### Phase 2B: Survival / Prognostic Model Extension
+Apply this 8-probe checklist **only when manuscript involves time-to-event outcomes** (OS, DFS, LRFS, DMFS, RFS, PFS, time-to-recurrence) **or prognostic model development** (Cox proportional hazards, DeepSurv, DeepHit, Random Survival Forest, nomogram development/validation, multi-state or multi-outcome survival cascade, risk-stratification with cutoff-based phenotyping).
+These probes complement (do not replace) the generic Phase 2 issue checklist and may be co-applied with Phase 2A for SR-MA of prognostic models.
+**Exempt**:
+- Pure diagnostic accuracy (sensitivity / specificity / AUC, binary classification with no time component)
+- Cross-sectional risk model without time-to-event endpoint
+- Replication of a documented prior methodology
+**Probe detail (S1–S8), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/survival_prognostic.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route a conditioning/causal-framing, competing-risks, or estimand-provenance (S8) design flaw into the Confidential Comments to the Editor and place it as the Major #1 candidate.
+### Phase 2C: Radiomics / Feature-Reproducibility Extension
+Apply this 4-probe checklist **only when the manuscript maps radiomic feature reliability/reproducibility or feature stability** (test-retest, noise sensitivity, ICC-based reproducibility), runs an **acquisition–reconstruction parameter sweep** (tube voltage, tube current, bin width, reconstruction kernel, slice thickness, iterative reconstruction), or claims that **reliability/robustness/harmonization-based feature filtering** (e.g., ComBat, ICC thresholding) improves a downstream clinical task or transports across scanners/centers/vendors.
+These probes complement (do not replace) the generic Phase 2 issue checklist. Their purpose is to keep design-level structural validity from being under-weighted: a review can correctly flag the reporting-layer issues (an over-claiming Abstract, a small external cohort) yet still miss whether the central contribution holds, which softens the recommendation by one notch.
+**Exempt**:
+- Single fixed-protocol radiomic model with no parameter sweep and no reliability-filtering claim
+- Pure deep-learning end-to-end imaging model (handcrafted feature reproducibility not at issue)
+- Replication of a documented prior radiomic pipeline with no new reliability/transportability claim
+**Probe detail (R1–R4), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/radiomics.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; a design-grid circularity (R1) or transportability-failure-framed-as-success (R3) finding is design-level, so surface it in the Confidential Comments to the Editor and keep its severity high rather than softening it to a reporting fix.
+### Phase 2D: Narrative / Review-Article Extension
+Apply this 8-probe checklist (RV1–RV8) **only when the manuscript is a Review / narrative review / primer / state-of-the-art / educational review** — i.e., a non-systematic synthesis rather than original research. Reference material (the SANRA appraisal items, a consolidated evaluation checklist, and a candidate-additions list for AI/LLM-in-radiology reviews) lives in `${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md`.
+The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not transfer to review articles. The key inversion: for original research, reviewers are discouraged from scope-expanding requests, but **for narrative reviews, identifying thematic gaps and proportionately suggesting missing content is an expected part of the reviewer's role** — error-spotting alone is necessary but not sufficient. Keep SANRA in its lane: it is a 6-item *critical appraisal tool, not a reporting guideline*, so do not over-enforce it (only RV3 is SANRA-aligned, and as a suggestion; do not demand PRISMA — narrative ≠ systematic).
+**Exempt**:
+- Original research / development / validation / trial (→ Phase 2 + 2A/2B/2C)
+- Systematic review **with pooling** (meta-analysis) → Phase 2A
+- Case report / editorial / commentary (opinion form; no recommendation gating)
+**Probe detail (RV1–RV8), with the verify-your-own-criticism gate and output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/narrative_review.md`. Load it and apply each probe when the trigger above fires; the SANRA appraisal items and candidate-additions catalog in `${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md` remain peer-review-specific supporting material. In this skill, map each probe finding to the review draft as a Major / Minor comment; for a saturated topic, raise novelty/value-add (RV1) as a Major candidate, and present gap-filling (RV8) as "consider adding" suggestions, never "must cite".
+### Phase 2E: Observational / Confounding Extension
+Apply this 6-probe checklist (O1–O6) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
+**Exempt**:
+- Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
+- Purely descriptive / prevalence reports with no adjusted association claim
+- Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)
+**Probe detail (O1–O6), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate that is imbalanced by exposure in Table 1 yet absent from the adjustment set) is data-checkable and the highest-yield probe — verify it against the manuscript's own Table 1. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), a selection/collider structure that could generate the association (O3), or an undisclosed complete-case collapse (O5) is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
+### Phase 2G: AI / ML Overclaiming Extension
+Apply when an AI/ML **primary study** (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.
+**Probe detail (AO0–AO4), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline.
+### Phase 3: Draft Review
+Before writing comments, skim the relevant model in `references/exemplar_reviews/` for the
+finding type at hand (AI overclaiming, reference-standard validity, data leakage, missing
+calibration). Each shows the same four moves — anchor the location, state the gap, phrase
+it as a partner (Aczel-compliant), and calibrate severity (design-level → Major #1). Model
+the anchoring and phrasing; do not copy — they are synthetic teaching examples.
+Generate `{manuscript_id}_review_draft.md`:
+```markdown
+# {manuscript_id} — Review Draft
+**Manuscript**: {title}
+**Journal**: {journal}
+**Type**: {Original Research | Review | Technical Note | ...}
+**Recommendation**: {Major Revision | Minor Revision}
+---
+## {Journal-specific scores section, if applicable}
+---
+## CONFIDENTIAL COMMENTS TO THE EDITOR
+{100-150 words: summary + strengths + key concerns + fatal flaw hierarchy if applicable + recommendation}
+**Clinical Impact**: {High/Moderate/Low} — {1 sentence on implications}
+---
+## COMMENTS TO THE AUTHORS
+**Research Summary & General Comments**
+{2-3 sentences summarizing objective, design, key finding (in your own words)}
+Major strengths:
+1. {Specific strength}
+2. {Specific strength}
+3. {Specific strength (optional)}
+{Scope + feasibility: 1-2 sentences — "I have suggestions focused on [areas]. Achievable within existing data."}
+(80-150 words total)
+**Major Comments**
+1) **{Issue title}**
+{Problem 1-2 sentences. Location cited.}
+Suggested revisions:
+- {Fix 1}
+- {Fix 2}
+2) **{Issue title}**
+...
+**Minor Comments**
+1) {One sentence, location cited.}
+2) ...
+**Closing Remark**
+{2-3 sentences, constructive.}
+```
+**Length targets (3-tier, data-grounded)**:
+> **Reference baseline (from peer-comment empirical analysis, n=21 reviewer blocks across 13 decision letters)**: median ≈ 545 words, central 50% range 366-856w, 90th percentile ≈ 870w, only 5% exceed 1000w. Most peer reviewers cluster below 900w.
+- **Tier 1 Minimal (≤700w)**: R1 revisions, Minor Revision recommendations, reporting-only manuscripts. Major 1-3, Minor 3-5.
+- **Tier 2 Standard (700-1000w) ★ default — most reviews should land here**: typical first-round reviews with 1-2 design-level concerns. Major 3-5, Minor 4-6. Sweet spot 800-950w — sits just above the 90th percentile of peer reviewers, expressing design-level rigor without overwhelming editor parsimony.
+- **Tier 3 Extended (1000-1400w)**: justified only when (a) fatal-flaw hierarchy required (≥2 design-level limitations), (b) cross-domain methodology (medical AI × radiology × biostatistics), (c) task-formulation misframing critique, or (d) AI/LLM evaluation requiring model-spec + prompt + selection-bias + framing 4-layer audit. Major 3-5, Minor 5-7. Frequency cap: ≤20% of reviews rolling — if every review trends Tier 3, the niche signal dilutes.
+- **Hard cap 1400 words**. Measure with `awk + wc` (no estimation) — at Phase 3 mid-checkpoint and Phase 6 final.
+- Each Major: 5-8 lines (Tier 1-2) or 8-12 lines (Tier 3, with Why it matters + alternative framings).
+- **Reference-baseline ratio** (self-QC metric): compute `your_wc / 545` and report. Ratio > 2.0 (above 1090w) flags trim candidate. Ratio < 1.0 may indicate insufficient design-level rigor for AI/methodology critique reviews.
+### Phase 4: Self-QC
+After drafting, verify mechanically:
+1. **Numerical accuracy**: All cited numbers (sample size, p-value, AUC) match the manuscript.
+2. **Citation accuracy**: Section/Table/Figure references match manuscript.
+3. **Feasibility**: All suggested revisions achievable with existing data.
+4. **Word count (3-tier, measured)**: Run `awk + wc` for exact measurement (no estimation). Identify which tier the Author section falls in (Tier 1 ≤700w / Tier 2 700-1000w ★ default / Tier 3 1000-1400w). Most reviews should land in Tier 2. If Tier 3, justify with a one-line rationale (which design-level concern warrants the extra length) and verify Tier 3 frequency stays ≤20% rolling. Hard cap 1400w. Also measure at Phase 3 mid-checkpoint, not only at final. Report **reference-baseline ratio** (`wc / 545w`) — ratio > 2.0 flags trim candidate.
+5. **Forbidden words**: No recommendation words (accept/reject/minor/major revision) in Comments to Authors.
+6. **Major #1 = task formulation flaw** (if present): if §3C-1 audit found framing mismatch, place it as Major #1. Do not let it be downgraded into adjacent measurement-level issues (selection bias, sample size).
+7. **AI pattern density (quantified threshold)**: em-dash ≤2 per 1000 words, structural rule-of-three ≤2 per Major comment, significance inflation ("genuinely", "truly", "indeed") 0 per Major, hedged Minor proportion ≥50% ("could", "would help", "I'd suggest" vs bare "Please [verb]").
+8. **Aczel tone audit** (`references/aczel_2021_reviewer2_patterns.md`):
+   - 0 attitude markers (reject/absurd/ridiculous/naive/oblivious/fail)
+   - 0 personal attacks ("the authors seem...", "the authors do not understand")
+   - ≥2 first-person rapport instances in General Comments / Closing Remark
+   - ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
+   - General Comments names ≥2 specific strengths before listing concerns
+   - At most 1 typo/grammar Minor Comment, only if in formal section or systematic
+9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P10 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
+10. **Radiomics-reproducibility QC** (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
+11. **Review-article QC** (if Phase 2D applied): Confirm RV1–RV8 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines).
+12. **AI/method priority QC**: Before a Major Revision recommendation, confirm Phase 2F was run. If novelty
+    and clinical/research utility are both weak, the recommendation must reflect that contribution-level
+    concern rather than treating all issues as fixable reporting defects.
+13. **Observational-confounding QC** (if Phase 2E applied): For any covariate imbalanced by exposure in Table 1 but absent from the adjustment set (O1), confirm the comment requests a concrete extended-adjustment sensitivity model, not a vague "adjust for more confounders." Confirm a selection/collider structure (O3) or an undisclosed complete-case collapse from a structural-zero dose covariate (O5) is raised at design-level severity, and that any E-value request (O6) targets the declared primary estimate rather than a supporting one.
+14. **Verify-your-own-criticism** (all reviews): For each Major framed as a technical inaccuracy or a citation–claim mismatch, confirm the reviewer's own assertion was checked against a current authoritative source (full paper, CrossRef, arXiv). Downgrade unverified technical claims to a hedged "Please verify…"; keep confirmed ones firm. Watch for status drift (a "preprint" since published; a method since adapted) before asserting the manuscript is wrong.
+Fix all issues found, then present to user.
+### Phase 5: Refinement
+1. Present the draft to the user for review.
+2. Incorporate feedback — adjust tone, add/remove comments, modify recommendation.
+3. Generate `{manuscript_id}_review_final.md` — the polished version.
+4. Generate `{manuscript_id}_submission.md` — formatted for copy-paste into editorial system:
+   - Strip markdown formatting for plain-text boxes
+   - Separate "Comments to Author" and "Confidential Comments to Editor"
+   - Include journal-specific score table if applicable
+### Phase 6: Pre-Submission QC
+- [ ] No recommendation words in Comments to Authors
+- [ ] All cited numbers match the manuscript
+- [ ] Major comments ranked by impact (Task formulation flaw, if present, as Major #1)
+- [ ] All suggestions feasible with existing data
+- [ ] Author section word count measured (awk + wc), tier identified (Tier 1 ≤700w / Tier 2 700-1000w ★ default / Tier 3 1000-1400w); Tier 3 justified + ≤20% rolling frequency
+- [ ] Reference-baseline ratio (`wc / 545w`) reported; ratio > 2.0 trimmed
+- [ ] Hard cap 1400 words not exceeded
+- [ ] AI pattern density within thresholds (em-dash ≤2/1000w; structural rule-of-three ≤2/Major; significance inflation 0/Major; hedged Minor ≥50%)
+- [ ] Fatal flaw hierarchy stated in Confidential Comments (if applicable)
+- [ ] Reject recommendations (if used): §1C condition checklist (design-level flaw + speculative practical value 3-trigger + novelty gap) explicitly verified — at least 2 of 3 conditions met
+- [ ] AI/method Major Revision recommendations: contribution/value gate checked; weak novelty + weak utility not silently softened
+## Tone and Calibration
+- **Default**: Developmental, constructive, partner-voice (not gatekeeper-voice)
+- **Aczel 2021 patterns** (`references/aczel_2021_reviewer2_patterns.md`): avoid attitude markers ("reject," "absurd," "oblivious"), boosters, personal attacks on authors, vague dismissals, and typo nitpicking; prefer first-person rapport ("I appreciate," "I stumbled over"), hedged suggestions ("I'd suggest," "could," "would help"), and critique aimed at the work rather than the people. Apply throughout drafting, not just QC.
+- **Escalate tone** only when: clinical validity threatened, patient safety concern, severe data leakage, or reference standard fundamentally flawed
+- **Default recommendation**: Major Revision (unless issues are purely reporting/clarity → Minor Revision)
+- **Fatal flaw signal**: State in Confidential Comments which issue(s) represent fundamental design limitations, rather than recommending Reject directly
+- **Contribution/priority override**: For original AI or method papers, a manuscript can be technically
+  analyzable and still below the journal's priority bar. When weak novelty and weak clinical/research
+  utility both hold, surface that in Confidential Comments and calibrate the recommendation upward from the
+  default Major Revision tier.
+- **Length proportionality**: Minor Revision ≤ 600 words; Major Revision ≤ 1000 words. Length signals difficulty — a Minor Revision review longer than the manuscript itself reads as Reviewer 2.
+## Signature Review Patterns
+Recurring high-yield checks — apply to every manuscript:
+1. **Patient-level data splitting**: Splitting at patient level, not image/exam level
+2. **Confidence intervals**: All primary metrics should have 95% CIs
+3. **Intended use statement**: Clinical workflow position and decision influenced should be clear
+4. **Calibration**: AUC alone insufficient for prediction models — calibration metrics needed
+5. **Overclaiming**: Language should match evidence level (CI overlap, small test sets, single-center)
+6. **Reproducibility**: Preprocessing, hyperparameters, segmentation protocols reported
+For survival / prognostic-model manuscripts, also apply the Phase 2B 8-probe audit (conditioning, censoring, competing risks, cutoff optimism, comparator horizon alignment, C-index variant transparency, calibration beyond discrimination, estimand provenance).
+For radiomic feature-reproducibility / phantom parameter-sweep / reliability-filtering manuscripts, also apply the Phase 2C 4-probe audit (design-grid circularity, construct validity / proxy-target gap, transportability framing with Reject-escalate calibration, multiplicity).
+For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 8-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.
+For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 6-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value), with O1 — a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set — checked against the manuscript's own Table 1.
+## Journal-Specific Formatting
+**Canonical source:** per-journal profile files at
+`references/reviewer_profiles/{JOURNAL_SHORTNAME}.md`
+In Phase 1 (Setup), after identifying the journal, read the matching profile and render its scorecard template at the top of the draft in Phase 3, above Confidential Comments to the Editor. This avoids duplicating journal form fields across multiple skills.
+Current profiles:
+| Short | Journal | System | Scorecard |
+|---|---|---|---|
+| KJR | Korean Journal of Radiology | ScholarOne | 8 items, Excellent→Poor |
+| RYAI | Radiology: Artificial Intelligence | ScholarOne | 5 items, 1–9 |
+| INSI | Insights into Imaging | Editorial Manager | 4 items, H/M/L |
+| AJR | American Journal of Roentgenology | Editorial Manager | Section-by-section |
+| EURE | European Radiology | Editorial Manager | INSI-style base |
+### Custom Journal
+If a journal has no profile yet, use the generic format from Phase 3 and ask the user for the invitation form's scorecard fields so a new profile can be added under `reviewer_profiles/`.
+## Output Contract
+| Artifact | Filename | Format |
+|----------|----------|--------|
+| Review draft | `{manuscript_id}_review_draft.md` | Markdown |
+| Final review | `{manuscript_id}_review_final.md` | Markdown |
+| Submission text | `{manuscript_id}_submission.md` | Plain text |
+## Skill Interactions
+| Need | Skill | When |
+|------|-------|------|
+| Reporting compliance | `/check-reporting` | Phase 2 — guideline check |
+| AI pattern detection | `/humanize` | If reviewing for AI writing patterns |
+## What This Skill Does NOT Do
+- Does not write the user's own manuscripts → use `/write-paper`
+- Does not perform self-review of own work → use `/self-review`
+- Does not submit the review to the journal system
+- Does not access journal editorial systems directly
+## Anti-Hallucination
+- **Never fabricate manuscript content.** All cited numbers, methods, and findings must come from the actual manuscript.
+- **Never invent journal scoring criteria.** If uncertain about a journal's format, ask the user or use the generic format.
+- **Never generate references from memory.** Use `/search-lit` if citations are needed for reviewer comments.
+- If a reporting guideline item is uncertain, flag it as `[CHECK]` rather than asserting compliance.

package/skills/peer-review/references/aczel_2021_reviewer2_patterns.md ADDED Viewed

@@ -0,0 +1,88 @@
+# Anti–"Reviewer 2" Tone Patterns
+Source: Aczel B, Szaszi B, Holcombe AO. *Don't be reviewer 2! Reflections on writing effective peer review comments.* Research Integrity and Peer Review. 2021;6:13. PMC8505560.
+This reference codifies the linguistic patterns to avoid (and the partner-voice patterns to use) when drafting peer reviews. Apply during Phase 3 drafting and Phase 4 Self-QC.
+## Avoid (the "Reviewer 2" signals)
+| Category | Examples | Why |
+|---|---|---|
+| **Attitude markers** | "reject," "absurdly," "illogical," "naive" | Reads as verdict, not feedback |
+| **Boosters** | "utterly ridiculous," "completely fails to," "totally inadequate" | Amplifies negativity, no information added |
+| **Self-mention as gatekeeper** | "I cannot possibly imagine," "I refuse to believe" | Centers reviewer ego over the work |
+| **Personal attacks on authors** | "The authors seem oblivious to...," "The authors do not understand..." | Critiques the people, not the work |
+| **Vague dismissals** | "There is a vast literature the authors have ignored," "This is well-known" | Not actionable; offers no path forward |
+| **Third-person accusatory framing** | "The authors do not sufficiently explain..." | Distancing register feels like a verdict |
+| **Nitpicking every typo** | Listing 10 grammar errors as separate Minor items | Signals reviewer hostility, not care for the manuscript |
+| **Requesting non-existent studies** | "The authors should have done a multi-center RCT" | Asks for impossible work to justify rejection |
+| **Self-citation pressure** | "The authors should cite [reviewer's papers]" unless directly relevant | Ego-driven, recognizable to editors |
+| **Over-length reviews** | Multi-hour reviews with 20+ comments | Signals desire to overwhelm rather than help |
+## Prefer (the partner-voice signals)
+| Category | Examples | Why |
+|---|---|---|
+| **First/second person rapport** | "I appreciate the thoroughness," "I stumbled over some jargon," "I look forward to the next version" | Builds collaboration, not hierarchy |
+| **Hedged suggestions** | "I'd suggest," "It would help if," "Consider whether," "could be clarified" | Leaves authorial judgment intact |
+| **Critique the work, not the people** | "The paper's claim to originality is weakened by..." (not "The authors seem oblivious to...") | Protects the relationship while making the same point |
+| **Specific, actionable feedback** | Name the page/table/line/cell. Specify the missing citation. State the exact change requested. | Authors can act in one editorial pass |
+| **Balanced framing** | Acknowledge the work's strengths in General Comments before listing concerns | Reviewer credibility ↑; authors more receptive |
+| **Distinguish reflection from request** | Numbered Minor items = actionable; Closing Remark = reflection. Do not blur. | Authors know what to do |
+| **Calibrate length to severity** | Minor Revision → 1-2 sentences per comment. Major Revision → 3-5 sentences with suggested fixes. | Length signals difficulty; mismatched length confuses authors |
+## Worked transformations (Aczel verbatim examples + skill-specific)
+### From the paper
+| Problematic | Improved |
+|---|---|
+| "The authors seem oblivious to the extensive existing literature on this subject in the field of higher education, and thus claim their discoveries as original when they are not" | "The paper's claim to originality is weakened by its lack of reference to similar work done in the field of higher education" |
+| "The authors do not sufficiently explain themselves in the Methods section, which is jargon-filled" | "I stumbled over some of the jargon in your Methods section; I'd suggest that you adopt more plain-language explanations" |
+### Skill-specific transformations (medical imaging context)
+| Problematic | Improved |
+|---|---|
+| "Please harmonize this label." | "I'd suggest harmonizing this label across the table." |
+| "The authors must clarify..." | "It would help future readers if [X] were clarified." |
+| "The Conclusion is overstated." | "The Conclusion could be tempered to match the single-unit scope of the data." |
+| "Please cite the formula." | "Citing the specific formula (e.g., Graybill 1976) would make the calculation unambiguously reproducible." |
+| "The methods section is unclear." | "I found myself wondering which environment was used for which analysis — a one-line breakdown would aid reproducibility." |
+## Phase 4 Self-QC additions
+Add to the existing Pre-Submission QC checklist:
+- [ ] **Attitude marker scan**: 0 instances of "reject," "absurd," "ridiculous," "naive," "oblivious," "fail," "wrong"
+- [ ] **Personal attack scan**: No "the authors seem...," "the authors do not understand," "the authors are unaware"
+- [ ] **First-person rapport present**: At least 2 instances of "I" in General Comments / Closing Remark (not in attitude-marker contexts)
+- [ ] **Hedged language ratio**: At least 50% of Minor Comment requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," "Please [verb]")
+- [ ] **Balance check**: General Comments names ≥2 specific strengths before listing concerns
+- [ ] **Length proportionality**: Minor Revision ≤ 600 words total; Major Revision ≤ 1000 words total
+- [ ] **Typo nitpicking limit**: At most 1 grammar/typo Minor Comment, only if in formal section (Acknowledgements, Declarations) or repeated systematically
+## When to escalate tone (override partner voice)
+Aczel's framework assumes the manuscript is fundamentally sound but needs revision. Escalate to firmer (still professional) language only when:
+- Patient safety concern (dose, drug error, PHI leak)
+- Severe data leakage (training-test contamination, label leakage)
+- Reference standard fundamentally invalid (no ground truth)
+- Citation fabrication or plagiarism suspected
+- Author conflict of interest undeclared
+Even at escalation, retain:
+- Specific evidence (page/line)
+- Hedged accusation form ("the data appear to suggest...")
+- Confidential Comments to Editor for the gravest concerns
+## Cross-references
+- Skill: `peer-review/SKILL.md` Phase 3 (Draft) and Phase 4 (Self-QC)
+- Companion: `peer-review/references/reviewer_profiles/{JOURNAL}.md` for journal-specific scorecard
+- Related: `~/.claude/rules/writing-style.md` (active voice, no "we believe/think")
+## Citation
+Aczel B, Szaszi B, Holcombe AO. Don't be reviewer 2! Reflections on writing effective peer review comments. *Research Integrity and Peer Review.* 2021 Oct 11;6(1):13. doi:10.1186/s41073-021-00117-3. PMCID: PMC8505560.

package/skills/peer-review/references/domain-probes/ai_overclaiming.md ADDED Viewed

@@ -0,0 +1,47 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# AI / ML overclaiming probes (AO0–AO4)
+A 4-probe checklist for medical-AI/ML primary studies (diagnostic, prognostic, triage, detection) where the **conclusion's reach exceeds the evidence**. These probes complement (do not replace) the generic Phase 2 issue checklist and the signature "Overclaiming vs evidence level" check. The aim is to keep a framing-level over-reach from passing as a wording nitpick: a paper can report sound metrics yet draw a clinical claim — generalizable, outperforms clinicians, deployment-ready — that the design does not support, and that claim is what a reader carries away. Run AO0 first.
+**AO0 — Locate the strongest claim, then its support (run before AO1; gates any over-reach finding)**:
+- Identify the load-bearing claims in the Title, Abstract, and Conclusion (the sentences a reader quotes). For each, find the specific evidence cited (which dataset, which comparison, which metric + uncertainty).
+- An over-reach finding is a **lead until the claim and its support are read together against the manuscript** — do not strawman a stray adjective. Escalate only when a headline claim genuinely outruns the cited evidence.
+- If the claim is already appropriately hedged to the evidence, record "claim matched to evidence" and move on.
+**AO1 — Generalizability claimed from limited external validation**:
+- Does the Abstract/Conclusion assert the model "generalizes," is "transferable/robust across settings," or is suitable for broad populations, while external validation is a single site / single scanner-vendor / single source (or absent)?
+- Sub-check: is the external set demographically narrow (single ethnicity, single sex-dominant, narrow age) relative to the population the claim names?
+- If the generalizability claim outruns the external evidence → recommend softening to the evidence ("validated at one external site") and moving multi-setting generalizability to a stated limitation + next step. MAJOR candidate when it is a headline claim; MINOR when it is a single qualifier in the Discussion.
+**AO2 — Superiority language against overlapping or under-powered comparison**:
+- Flag "outperforms", "superior to", "beats", "can replace [clinician/radiologist]" when (a) the model vs comparator 95% CIs overlap, (b) no test of the *difference* is reported (two separate AUCs are not a comparison), or (c) the comparison rests on a small test set / few readers.
+- Ask for the difference in the metric with its CI and a paired test of that difference, not two standalone estimates.
+- If the difference is not statistically supported → recommend reframing from "outperforms" to "comparable to" (still a meaningful result). MAJOR when a superiority/replacement claim is the headline; otherwise MINOR.
+**AO3 — Comparison-frame mismatch (model task ≠ human task)**:
+- When a model-vs-clinician comparison drives a claim, verify the two performed the **same task on the same inputs under the same constraints**: same images/inputs available, same time budget, same question asked, same decision point.
+- Common mismatches: the model sees a curated single view while readers see the full study; readers are timed or work from a different modality; the "reader" benchmark is a literature value on a different cohort.
+- A mismatch makes "outperforms clinicians" non-interpretable as a clinical claim → ask the authors to state exactly which task the comparison establishes, or to align the conditions. MAJOR candidate when it underpins a headline.
+**AO4 — Deployment / clinical-readiness claim from retrospective internal evidence**:
+- Flag "ready for clinical deployment", "can be used to triage/guide treatment", "will reduce workload/cost", or a recommended decision threshold, when the evidence is a retrospective, internally-split (or even external but observational) accuracy study with no prospective, silent-trial, or decision-impact data and (often) no calibration or decision-curve analysis.
+- Discrimination on retrospective data does not establish that acting on the model helps patients; a probability that drives a decision must also be calibrated, and net benefit must be shown.
+- Recommend reframing deployment/utility language to "supports further prospective evaluation", and (where a threshold is proposed) adding calibration + decision-curve evidence. MAJOR when a deployment/care-directive claim is made; MINOR when only a hedged "potential utility" sentence.
+**Output template (AO1 example)**:
+> "The Conclusion states the model 'generalizes across institutions,' but external validation appears limited to a single site ([Methods, External validation]). I'd suggest softening this to the evidence — e.g., 'validated at one external site' — and framing multi-institution generalizability as a stated limitation and a next step. If a broader claim is intended, an external set spanning multiple sites/vendors would be needed to support it."
+**Output template (AO2 / AO3 example)**:
+> "The 'outperforms radiologists' claim rests on a comparison whose 95% CIs for model and reader [metric] overlap ([Figure/Table]), and no test of the difference is reported; the reader task also differs from the model's in [inputs/time] ([Methods/Table]). I'd suggest (a) reporting the difference in [metric] with its CI and a paired test rather than two separate estimates, and (b) stating explicitly which clinical task the comparison establishes. If the difference is not statistically supported, reframing from 'outperforms' to 'comparable to' would be both defensible and still a meaningful result."
+**Discipline — leads vs findings (applies to AO0–AO4)**:
+- A claim-vs-evidence mismatch surfaced by a quick scan is a **lead, not a finding, until the claim sentence and its cited support are read together** against the manuscript. Do not escalate a hedged Discussion qualifier as if it were a headline.
+- Anchor every over-reach comment to the exact claim location and the exact evidence (dataset, comparison, metric + CI). A comment that names the location and the gap is actionable; "the authors overclaim" is not.
+- Keep severity tied to *where* the claim sits and *what it drives*: a headline/clinical-action claim that outruns the design is design-/framing-level (MAJOR, often Major #1); a stray adjective is MINOR.

package/skills/peer-review/references/domain-probes/narrative_review.md ADDED Viewed

@@ -0,0 +1,44 @@
+<!-- Domain probe module — shared, vendored BYTE-IDENTICAL by /peer-review and /self-review.
+     Severity words below (MAJOR / MINOR / major / minor) denote finding severity, NOT a journal
+     recommendation. Each consuming skill maps findings to its own output:
+       - peer-review: Major / Minor comments + Confidential Comments to the Editor; a task- or
+         design-level flaw is placed as Major #1.
+       - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
+     Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
+# Narrative / Review-Article probes (RV1–RV8)
+An 8-probe checklist for a Review / narrative review / primer / state-of-the-art / educational review — i.e., a non-systematic synthesis rather than original research. Supporting appraisal material (the SANRA appraisal items, a consolidated evaluation checklist, and a candidate-additions catalog for AI/LLM-in-radiology reviews) is maintained separately by the peer-review skill and is not required to apply RV1–RV8 below.
+The original-research probes (the generic Phase 2 issue checklist, and the SR-MA / Survival / Radiomics probes) do not transfer to review articles. The key inversion: for original research, reviewers are discouraged from scope-expanding requests, but **for narrative reviews, identifying thematic gaps and proportionately suggesting missing content is an expected part of the reviewer's role** — error-spotting alone is necessary but not sufficient. Keep SANRA in its lane: it is a 6-item *critical appraisal tool, not a reporting guideline*, so do not over-enforce it (only RV3 is SANRA-aligned, and as a suggestion; do not demand PRISMA — narrative ≠ systematic).
+**RV1 — Novelty & value-add** *(editorial value-add axis)*: Against ≥2–3 recent reviews/primers on the same topic, does the manuscript state explicitly what it adds? For saturated topics, if the authors do not position their contribution against the current review literature, the incremental value is hard to judge — MAJOR candidate. Judge contribution magnitude only; scope-fit is the editor's call.
+**RV2 — Scope & aims clarity** (SANRA items 1–2): Is the topic's importance established, and are the review's aims and scope boundaries (what is included/excluded) explicit?
+**RV3 — Evidence-gathering transparency** *(SANRA item 3, suggestion-level)*: Even a narrative review benefits from one paragraph on how the literature was identified (databases, time window, selection logic). This is **not a reject criterion** — phrase it as a SANRA-aligned transparency suggestion. Do not require PRISMA.
+**RV4 — Technical & medical accuracy** *(reviewer niche strength)*: Engineering correctness (autoregressive decoding, RAG, RLHF, instruction tuning, hallucination mechanisms, evaluation/mitigation methods) and medical correctness (radiology claims, clinical examples, anatomy/imaging detail). Itemize errors with location. This axis is where a domain-literate reviewer adds unique value.
+> **Verify-your-own-criticism gate**: before raising a technical inaccuracy or a citation–claim mismatch as a major finding, cross-check the assertion against a current authoritative source (the full cited paper, CrossRef, arXiv). Fast-moving fields make critiques go stale: a method dismissed as "not applicable" may have been adapted, and a "preprint" may since have been peer-reviewed. If unverified, downgrade to a hedged "Please verify…"; if confirmed, state it firmly. This applies with extra force to claims about what a cited reference *argues* (a review about hallucination must not itself mis-attribute a source).
+**RV5 — Taxonomy / synthesis coherence**: Is the manuscript's classification mutually exclusive and collectively exhaustive, and does it map to established taxonomies (intrinsic vs extrinsic; faithfulness vs factuality; published hallucination surveys)? Ad-hoc categories should be reconciled with an established taxonomy. Is the synthesis integrative rather than a list?
+**RV6 — Balance, currency, citation accuracy** (SANRA items 4–5): Is conflicting evidence handled fairly (no cherry-picking)? Are citations current and primary-source-weighted? Spot-check citation accuracy (author/year/claim match) — for a review *about* hallucination, citation errors are thematically critical.
+**RV7 — Load-bearing figures/tables** *(editorial value-add axis; SANRA item 6 secondary)*: Are there standardized comparison tables, a landscape figure, or a concrete clinical worked example? Assess whether figures/tables carry synthesis weight or are decorative — strong radiology-AI reviews tend to use standardized comparison matrices and a worked example.
+**RV8 — Constructive gap-filling & additions** *(the expected-role probe)*: Identify missing topics/frameworks/key references and propose them as **"consider adding X because it directly supports Y"** — never "must cite." Tier candidates by publication status:
+- *Peer-reviewed guidelines*: TRIPOD-LLM, MI-CLAIM-GEN, and STARD-AI (all Nature Medicine), and the CLAIM 2024 update (Radiology: AI)
+- *Preprint (label as such)*: any not-yet-peer-reviewed arXiv/medRxiv item — name it as a preprint and do not place it at the same level as peer-reviewed guidelines. Verify status before citing, since preprints are frequently published later (a checklist first posted to arXiv may since have appeared in a journal)
+- *Concepts/tools*: RAG specifics (retrieval failure vs fabrication), uncertainty/confidence calibration, radiology-specific evaluation (RadGraph, CheXbert/CheXpert-F1, ReXTrust), regulatory context (FDA 510(k)/CE, RADAR)
+Keep additions **proportionate** (≈ ≤1 new reference per page, each motivated; no wholesale rewrite). Suggesting missing *literature/topics* is expected; demanding new *studies* is not.
+**Output template (RV1 example)**:
+> "The topic of LLM hallucinations is now addressed by several recent reviews, so it would strengthen the manuscript to state explicitly what this primer adds beyond them — for example, a radiology-specific failure taxonomy, a worked clinical example, or an actionable verification workflow that existing general-purpose reviews do not provide. As written, the Introduction does not position the contribution against the current review literature, which makes the incremental value difficult to judge."
+**Output template (RV8 example)**:
+> "The mitigation section would benefit from engaging with emerging reporting standards for generative models, as these directly support the manuscript's call for controlled deployment. Consider adding a brief discussion of TRIPOD-LLM and MI-CLAIM-GEN (both peer-reviewed reporting guidelines for LLM/generative studies), and clarifying how retrieval-augmented generation shifts the dominant failure mode from fabrication toward retrieval error rather than eliminating hallucination, a distinction the current text conflates."
+This module gives review/narrative manuscripts a dedicated audit gate, on the principle that constructive gap-filling is an expected part of appraising a review article.