medsci-skills 4.7.0 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/README.md +7 -0
  2. package/metadata/distribution_files.json +51 -36
  3. package/metadata/distribution_manifest.json +1 -1
  4. package/package.json +1 -1
  5. package/skills/analyze-stats/scripts/check_generated_code.py +7 -1
  6. package/skills/analyze-stats/scripts/rating_monotonicity.py +132 -0
  7. package/skills/analyze-stats/tests/fixtures/gen_palette.py +22 -0
  8. package/skills/analyze-stats/tests/test_generated_code.sh +12 -0
  9. package/skills/design-study/SKILL.md +33 -0
  10. package/skills/peer-review/SKILL.md +9 -9
  11. package/skills/peer-review/references/domain-probes/ai_overclaiming.md +7 -1
  12. package/skills/peer-review/references/domain-probes/diagnostic_accuracy.md +7 -2
  13. package/skills/peer-review/references/domain-probes/observational_confounding.md +11 -1
  14. package/skills/peer-review/references/domain-probes/sr_ma.md +21 -1
  15. package/skills/peer-review/references/domain-probes/survival_prognostic.md +1 -0
  16. package/skills/self-review/SKILL.md +77 -5
  17. package/skills/self-review/references/domain-probes/ai_overclaiming.md +7 -1
  18. package/skills/self-review/references/domain-probes/diagnostic_accuracy.md +7 -2
  19. package/skills/self-review/references/domain-probes/observational_confounding.md +11 -1
  20. package/skills/self-review/references/domain-probes/sr_ma.md +21 -1
  21. package/skills/self-review/references/domain-probes/survival_prognostic.md +1 -0
  22. package/skills/self-review/scripts/check_artifact_coverage.py +91 -2
  23. package/skills/self-review/scripts/check_classical_style.py +20 -6
  24. package/skills/self-review/scripts/check_cohort_arithmetic.py +8 -1
  25. package/skills/self-review/scripts/check_null_calibration.py +175 -0
  26. package/skills/self-review/scripts/check_scope_coherence.py +22 -2
  27. package/skills/self-review/scripts/check_supplement_hygiene.py +255 -0
  28. package/skills/self-review/tests/fixtures/cohort_rate_tier_fp.md +4 -0
  29. package/skills/self-review/tests/fixtures/coverage_promised_stat.md +9 -0
  30. package/skills/self-review/tests/fixtures/coverage_promised_stat_ok.md +8 -0
  31. package/skills/self-review/tests/fixtures/null_bad.md +11 -0
  32. package/skills/self-review/tests/fixtures/null_clean.md +13 -0
  33. package/skills/self-review/tests/fixtures/scope_disclaimer.md +9 -0
  34. package/skills/self-review/tests/fixtures/style_dagger_footnote.md +11 -0
  35. package/skills/self-review/tests/fixtures/supp_clean.md +10 -0
  36. package/skills/self-review/tests/fixtures/supp_dirty.md +14 -0
  37. package/skills/self-review/tests/fixtures/supp_xref_body.md +4 -0
  38. package/skills/self-review/tests/fixtures/supp_xref_supp.md +5 -0
  39. package/skills/self-review/tests/test_artifact_coverage.sh +18 -0
  40. package/skills/self-review/tests/test_classical_style.sh +12 -0
  41. package/skills/self-review/tests/test_cohort_arithmetic.sh +13 -0
  42. package/skills/self-review/tests/test_null_calibration.sh +47 -0
  43. package/skills/self-review/tests/test_scope_coherence.sh +13 -0
  44. package/skills/self-review/tests/test_supplement_hygiene.sh +63 -0
package/README.md CHANGED
@@ -11,6 +11,7 @@
11
11
  [![CI](https://img.shields.io/github/actions/workflow/status/Aperivue/medsci-skills/validate.yml?branch=main&style=flat-square&label=CI)](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
12
12
  ![Skills](https://img.shields.io/badge/Skills-45-brightgreen?style=flat-square)
13
13
  [![npm](https://img.shields.io/npm/v/medsci-skills?style=flat-square&label=npm&color=cb3837)](https://www.npmjs.com/package/medsci-skills)
14
+ [![Watch the 2-min intro](https://img.shields.io/badge/▶_Watch-2--min_intro-FF0000?style=flat-square&logo=youtube&logoColor=white)](https://youtu.be/MclQ_RIofpE)
14
15
  [![good first issues](https://img.shields.io/github/issues/Aperivue/medsci-skills/good%20first%20issue?style=flat-square&label=good%20first%20issues&color=7057ff)](https://github.com/Aperivue/medsci-skills/contribute)
15
16
 
16
17
  [![Agent Skills](https://img.shields.io/badge/Agent_Skills-standard-blue?style=flat-square)](https://agentskills.io)
@@ -267,6 +268,12 @@ The E2E pipeline (`orchestrate --e2e`) produces everything up to `qc/`. The `sub
267
268
 
268
269
  ## What's New
269
270
 
271
+ **v4.8** is the **review-harvest batch** — deterministic detector hardening promoted from real-manuscript review cycles. Additive and backward-compatible; still 45 skills / 36 guidelines, analysis-integrity detectors **30 → 32**:
272
+
273
+ - **Two new gates** — `check_supplement_hygiene.py` lints the rendered supplement / tables / caption files (not just the manuscript) for §-labels, placeholders, build markers, response-letter framing, and unresolved body↔supplement cross-references; `check_null_calibration.py` flags a headline negative/equivalence claim made without a minimum-detectable-effect / power / equivalence statement.
274
+ - **Four detector false-positive fixes** — gates no longer fire on a recommended colorblind-safe palette, author-footnote `§` daggers, a correctly-hedged disclaimer, or a tier-label digit; each with a regression fixture and three newly CI-wired test suites.
275
+ - **Nine reviewer-side domain probes** (SR/MA, observational, diagnostic, AI-overclaiming, survival) plus a `/design-study` design-stage ceiling gate for perceptual/reader-AI studies and a reusable confidence-weighted-rating→AUC monotonicity template.
276
+
270
277
  **v4.7** is the **self-update foundation** — physician-researchers stay current without GitHub, git, or a terminal. Additive and backward-compatible; still 45 skills / 36 guidelines / 30 detectors:
271
278
 
272
279
  - **Transactional, crash-recoverable installer.** Each install runs through a durable journal state machine recovered on the next run (roll back / forward-clean / fail-closed), with per-target SHA-256 inventories — your modified or third-party skills are backed up and never clobbered or auto-deleted.
@@ -358,8 +358,13 @@
358
358
  },
359
359
  {
360
360
  "path": "skills/analyze-stats/scripts/check_generated_code.py",
361
- "size": 14525,
362
- "sha256": "b9918728e4eac298fdb6adffa2a79faf835b395b65bace69409c190e88c13619"
361
+ "size": 15024,
362
+ "sha256": "b8643d9d9a4b600af428604811135fc27cee6eb2140bdae5e393ed9b3be37fae"
363
+ },
364
+ {
365
+ "path": "skills/analyze-stats/scripts/rating_monotonicity.py",
366
+ "size": 5494,
367
+ "sha256": "b7f8f36c7af060e2dec60f51185ef4cc5a0f2ef20c1d509fb845b36521f420b3"
363
368
  },
364
369
  {
365
370
  "path": "skills/analyze-stats/skill.yml",
@@ -868,8 +873,8 @@
868
873
  },
869
874
  {
870
875
  "path": "skills/design-study/SKILL.md",
871
- "size": 15849,
872
- "sha256": "41e461909bbf56f0cb810c6d0fe809380b70125911bbea014d9d7b8b397a13e9"
876
+ "size": 18282,
877
+ "sha256": "dfd73871102af6beab3d21f60f650db41ec2e4025d6ee19eae8de61b17851fd4"
873
878
  },
874
879
  {
875
880
  "path": "skills/design-study/skill.yml",
@@ -2363,8 +2368,8 @@
2363
2368
  },
2364
2369
  {
2365
2370
  "path": "skills/peer-review/SKILL.md",
2366
- "size": 50226,
2367
- "sha256": "48e0841a8fc2364c54964ae2346c13fb14978c87478ae1ed37964f728e65f02d"
2371
+ "size": 50304,
2372
+ "sha256": "2cabbeb0695e52e6a4aa1f341a0ddeb3cfe088c53e25c9803a2b1b5f3a0b5bd1"
2368
2373
  },
2369
2374
  {
2370
2375
  "path": "skills/peer-review/references/aczel_2021_reviewer2_patterns.md",
@@ -2373,8 +2378,8 @@
2373
2378
  },
2374
2379
  {
2375
2380
  "path": "skills/peer-review/references/domain-probes/ai_overclaiming.md",
2376
- "size": 12790,
2377
- "sha256": "cda5e4c56c37c37e036565f94d05487b8aa2d1c90b2a6a5b2e3113171456b6d7"
2381
+ "size": 14205,
2382
+ "sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
2378
2383
  },
2379
2384
  {
2380
2385
  "path": "skills/peer-review/references/domain-probes/case_report.md",
@@ -2388,8 +2393,8 @@
2388
2393
  },
2389
2394
  {
2390
2395
  "path": "skills/peer-review/references/domain-probes/diagnostic_accuracy.md",
2391
- "size": 7673,
2392
- "sha256": "292a3cefe72e55856084dc2d0ba7cd46ec60d8f52fd488ea57b72534f7ee39f7"
2396
+ "size": 9002,
2397
+ "sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
2393
2398
  },
2394
2399
  {
2395
2400
  "path": "skills/peer-review/references/domain-probes/equity_fairness.md",
@@ -2408,8 +2413,8 @@
2408
2413
  },
2409
2414
  {
2410
2415
  "path": "skills/peer-review/references/domain-probes/observational_confounding.md",
2411
- "size": 24897,
2412
- "sha256": "8c21c2b0cde6d46dfdc9757571aeb480e0a692d30aea19d56b06654bbd3ba803"
2416
+ "size": 27378,
2417
+ "sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
2413
2418
  },
2414
2419
  {
2415
2420
  "path": "skills/peer-review/references/domain-probes/radiomics.md",
@@ -2423,13 +2428,13 @@
2423
2428
  },
2424
2429
  {
2425
2430
  "path": "skills/peer-review/references/domain-probes/sr_ma.md",
2426
- "size": 13933,
2427
- "sha256": "bcd74a02e89aec5c0436297b93181cd256f7f3a0d676843c214e330fa8d3b036"
2431
+ "size": 17239,
2432
+ "sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
2428
2433
  },
2429
2434
  {
2430
2435
  "path": "skills/peer-review/references/domain-probes/survival_prognostic.md",
2431
- "size": 13073,
2432
- "sha256": "0a5922aa7ca389089de394b3fa569565600384ceabae0bd25bfd50efa901cd13"
2436
+ "size": 13765,
2437
+ "sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
2433
2438
  },
2434
2439
  {
2435
2440
  "path": "skills/peer-review/references/exemplar_reviews/README.md",
@@ -2843,13 +2848,13 @@
2843
2848
  },
2844
2849
  {
2845
2850
  "path": "skills/self-review/SKILL.md",
2846
- "size": 84876,
2847
- "sha256": "e4d5cc5d0e75d8d13923cedf5d9260a759077f1ad86d68255f0f9323e8d11616"
2851
+ "size": 89813,
2852
+ "sha256": "54bcd9c6e751555044b9b436db9e8f10e0be280b60c00048c68de5570512bf16"
2848
2853
  },
2849
2854
  {
2850
2855
  "path": "skills/self-review/references/domain-probes/ai_overclaiming.md",
2851
- "size": 12790,
2852
- "sha256": "cda5e4c56c37c37e036565f94d05487b8aa2d1c90b2a6a5b2e3113171456b6d7"
2856
+ "size": 14205,
2857
+ "sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
2853
2858
  },
2854
2859
  {
2855
2860
  "path": "skills/self-review/references/domain-probes/case_report.md",
@@ -2863,8 +2868,8 @@
2863
2868
  },
2864
2869
  {
2865
2870
  "path": "skills/self-review/references/domain-probes/diagnostic_accuracy.md",
2866
- "size": 7673,
2867
- "sha256": "292a3cefe72e55856084dc2d0ba7cd46ec60d8f52fd488ea57b72534f7ee39f7"
2871
+ "size": 9002,
2872
+ "sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
2868
2873
  },
2869
2874
  {
2870
2875
  "path": "skills/self-review/references/domain-probes/equity_fairness.md",
@@ -2883,8 +2888,8 @@
2883
2888
  },
2884
2889
  {
2885
2890
  "path": "skills/self-review/references/domain-probes/observational_confounding.md",
2886
- "size": 24897,
2887
- "sha256": "8c21c2b0cde6d46dfdc9757571aeb480e0a692d30aea19d56b06654bbd3ba803"
2891
+ "size": 27378,
2892
+ "sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
2888
2893
  },
2889
2894
  {
2890
2895
  "path": "skills/self-review/references/domain-probes/radiomics.md",
@@ -2898,13 +2903,13 @@
2898
2903
  },
2899
2904
  {
2900
2905
  "path": "skills/self-review/references/domain-probes/sr_ma.md",
2901
- "size": 13933,
2902
- "sha256": "bcd74a02e89aec5c0436297b93181cd256f7f3a0d676843c214e330fa8d3b036"
2906
+ "size": 17239,
2907
+ "sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
2903
2908
  },
2904
2909
  {
2905
2910
  "path": "skills/self-review/references/domain-probes/survival_prognostic.md",
2906
- "size": 13073,
2907
- "sha256": "0a5922aa7ca389089de394b3fa569565600384ceabae0bd25bfd50efa901cd13"
2911
+ "size": 13765,
2912
+ "sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
2908
2913
  },
2909
2914
  {
2910
2915
  "path": "skills/self-review/references/exemplar_findings/README.md",
@@ -2953,8 +2958,8 @@
2953
2958
  },
2954
2959
  {
2955
2960
  "path": "skills/self-review/scripts/check_artifact_coverage.py",
2956
- "size": 12440,
2957
- "sha256": "44a935c657bb597e3596c2e361c8d8ee89c613cb020715f27792cf1844731066"
2961
+ "size": 17113,
2962
+ "sha256": "56096c39ddb0083c04a1254f06bafa6fac9fc8a136c9246f68773f0ba5da96d4"
2958
2963
  },
2959
2964
  {
2960
2965
  "path": "skills/self-review/scripts/check_claim_artifact.py",
@@ -2963,19 +2968,24 @@
2963
2968
  },
2964
2969
  {
2965
2970
  "path": "skills/self-review/scripts/check_classical_style.py",
2966
- "size": 10067,
2967
- "sha256": "5cf7f9a331a4ebcb6f8523fcb5fd1b872079e3bb6e00bf9821a03d504174fdaa"
2971
+ "size": 10953,
2972
+ "sha256": "3b5c85edc57ee607a2b0a10898d68ac163ce3be6fe50b82c8c11490d7bc2705a"
2968
2973
  },
2969
2974
  {
2970
2975
  "path": "skills/self-review/scripts/check_cohort_arithmetic.py",
2971
- "size": 23363,
2972
- "sha256": "4ef7bded82091bc6cac8bf35cfa19bc1651acb7e0df2aa8b742bd7a04c8b3991"
2976
+ "size": 23868,
2977
+ "sha256": "55933d25d824bf073c0a8b7abc42d2f160c459a288ab932c50d63cf8c03afd37"
2973
2978
  },
2974
2979
  {
2975
2980
  "path": "skills/self-review/scripts/check_confounding_completeness.py",
2976
2981
  "size": 21506,
2977
2982
  "sha256": "7d3e67074d58a28ffee52ce64b486231f103a3ddcaf6b3b6ee83ba5f89c63bc2"
2978
2983
  },
2984
+ {
2985
+ "path": "skills/self-review/scripts/check_null_calibration.py",
2986
+ "size": 7594,
2987
+ "sha256": "9ddff01722c34efb6ffd757ae762c6ee12f5993bf13b11313c2e20b60b26cab3"
2988
+ },
2979
2989
  {
2980
2990
  "path": "skills/self-review/scripts/check_panel_diversity.py",
2981
2991
  "size": 15324,
@@ -2998,8 +3008,13 @@
2998
3008
  },
2999
3009
  {
3000
3010
  "path": "skills/self-review/scripts/check_scope_coherence.py",
3001
- "size": 9538,
3002
- "sha256": "45915b2736f0356e1b2e827939090e5034a01ba5c310b952dd13ad6f2df45a61"
3011
+ "size": 10818,
3012
+ "sha256": "820dfc264c2a4f62c79c0c7123a3e1a8b59a100b89654617a08ff55deeb25a75"
3013
+ },
3014
+ {
3015
+ "path": "skills/self-review/scripts/check_supplement_hygiene.py",
3016
+ "size": 11088,
3017
+ "sha256": "f89027472cdf0258357c3b0f0b0f3fec09b5ea65cc1373292797b818d1acf444"
3003
3018
  },
3004
3019
  {
3005
3020
  "path": "skills/self-review/skill.yml",
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "schema_version": 1,
3
- "version": "4.7.0",
3
+ "version": "4.8.0",
4
4
  "owned_skills": [
5
5
  "academic-aio",
6
6
  "add-journal",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "medsci-skills",
3
- "version": "4.7.0",
3
+ "version": "4.8.0",
4
4
  "description": "MedSci Skills — a medical/scientific research skill suite for AI coding agents (Claude Code, Codex, Cursor, Copilot). The npm package is a terminal-friendly installer shortcut; the canonical distribution remains the GitHub repository and the Claude Code plugin marketplace.",
5
5
  "license": "SEE LICENSE IN LICENSE",
6
6
  "homepage": "https://github.com/Aperivue/medsci-skills#readme",
@@ -145,9 +145,15 @@ def check_text_common(src: str, lang: str) -> list[dict]:
145
145
  for m in NUM_LITERAL_BODY.finditer(src):
146
146
  body = m.group(1)
147
147
  # ignore obvious non-data: ranges, single repeated, function-call args with kwargs
148
- nums = NUM_TOKEN.findall(body)
149
148
  if "=" in body: # kwargs like figsize=(8,6) or linspace(0,1,...) — not table data
150
149
  continue
150
+ # A list/tuple of string literals (e.g. a hex-color palette
151
+ # ['#000000','#E69F00',...] — exactly the colorblind-safe WONG palette that
152
+ # make-figures recommends) is NOT hand-typed tabular data. Strip quoted
153
+ # substrings before counting numeric tokens, so digits living inside string
154
+ # literals (the "00" in '#E69F00', RGBA codes, category labels) don't make a
155
+ # string list look table-shaped. Genuine numeric data is unquoted.
156
+ nums = NUM_TOKEN.findall(STR_LITERAL.sub("", body))
151
157
  if len(nums) >= DATA_LITERAL_STANDALONE or (len(nums) >= DATA_LITERAL_MIN and has_read):
152
158
  ln = src[:m.start()].count("\n") + 1
153
159
  claims.append({
@@ -0,0 +1,132 @@
1
+ #!/usr/bin/env python3
2
+ """Confidence-weighted rating → AUC monotonicity probe (reusable template).
3
+
4
+ NOT an auto-discovered detector — it needs the score *encoding* as input, so it is
5
+ a template/helper you point at your own score definition, not a manuscript scanner.
6
+
7
+ When an observer / reader study collapses a (binary-call × confidence) rating into a
8
+ single score used as the ROC/AUC predictor, that score MUST be strictly monotonic in
9
+ "evidence for the positive label" across the full ladder:
10
+
11
+ negative-call, highest confidence = strongest evidence AGAINST positive = LOWEST
12
+
13
+ negative-call, lowest confidence
14
+ positive-call, lowest confidence
15
+
16
+ positive-call, highest confidence = strongest evidence FOR positive = HIGHEST
17
+
18
+ A *folded* score — the classic bug `cws = confidence if positive_call else (K+1) − confidence`
19
+ — makes negative/high-confidence collide with positive/low-confidence and is NOT
20
+ monotonic; it understates discrimination and can flip a gradient to equivalence. Prose
21
+ review cannot see this; re-checking the encoding does.
22
+
23
+ Input JSON (`--encoding score_def.json`):
24
+ {
25
+ "confidence_levels": [1, 2, 3, 4, 5],
26
+ "scores": {
27
+ "positive": {"1": 6, "2": 7, "3": 8, "4": 9, "5": 10},
28
+ "negative": {"1": 5, "2": 4, "3": 3, "4": 2, "5": 1}
29
+ }
30
+ }
31
+ `scores[call][confidence]` is the numeric value fed to the ROC/AUC routine for that cell.
32
+
33
+ Exit codes: 0 monotonic, 1 a collision/inversion (folded or mis-encoded), 2 usage.
34
+ Stdlib-only (json / argparse / sys). Run `--demo` for the 10-combination unit test.
35
+ """
36
+
37
+ from __future__ import annotations
38
+
39
+ import argparse
40
+ import json
41
+ import sys
42
+
43
+
44
+ def evidence_order(levels: list) -> list[tuple[str, object]]:
45
+ """Cells ordered from strongest-against-positive to strongest-for-positive."""
46
+ asc = sorted(levels, key=lambda x: float(x))
47
+ # negative call: highest confidence first (strongest against) → lowest
48
+ neg = [("negative", c) for c in reversed(asc)]
49
+ # positive call: lowest confidence first → highest (strongest for)
50
+ pos = [("positive", c) for c in asc]
51
+ return neg + pos
52
+
53
+
54
+ def check_encoding(spec: dict) -> dict:
55
+ levels = spec["confidence_levels"]
56
+ scores = spec["scores"]
57
+ order = evidence_order(levels)
58
+ seq = []
59
+ for call, conf in order:
60
+ # JSON object keys are strings; tolerate int/str confidence keys.
61
+ cell = scores[call]
62
+ val = cell.get(str(conf), cell.get(conf))
63
+ if val is None:
64
+ return {"ok": False, "problems": [f"missing score for {call}/conf {conf}"],
65
+ "sequence": seq}
66
+ seq.append({"call": call, "confidence": conf, "score": val})
67
+ problems = []
68
+ for a, b in zip(seq, seq[1:]):
69
+ if b["score"] == a["score"]:
70
+ problems.append(
71
+ f"collision: {a['call']}/{a['confidence']} and {b['call']}/{b['confidence']} "
72
+ f"share score {a['score']} (a folded/mirrored encoding maps opposite evidence "
73
+ f"to the same value)")
74
+ elif b["score"] < a["score"]:
75
+ problems.append(
76
+ f"inversion: {a['call']}/{a['confidence']} (={a['score']}) ranks above "
77
+ f"{b['call']}/{b['confidence']} (={b['score']}) but carries weaker evidence "
78
+ f"for the positive label")
79
+ return {"ok": not problems, "problems": problems,
80
+ "sequence": [(s["call"], s["confidence"], s["score"]) for s in seq]}
81
+
82
+
83
+ def _demo() -> int:
84
+ """The 10-combination (K=5) unit test: a correct directional encoding passes,
85
+ the folded encoding fails on collisions."""
86
+ levels = [1, 2, 3, 4, 5]
87
+ correct = {"confidence_levels": levels,
88
+ "scores": {"positive": {str(c): 5 + c for c in levels},
89
+ "negative": {str(c): 6 - c for c in levels}}}
90
+ folded = {"confidence_levels": levels,
91
+ "scores": {"positive": {str(c): c for c in levels},
92
+ "negative": {str(c): 6 - c for c in levels}}}
93
+ rc = check_encoding(correct)
94
+ rf = check_encoding(folded)
95
+ ok = rc["ok"] and not rf["ok"]
96
+ print(f"correct directional encoding: monotonic={rc['ok']} (expected True)")
97
+ print(f"folded encoding: monotonic={rf['ok']} (expected False)")
98
+ if rf["problems"]:
99
+ print(f" folded problems[0]: {rf['problems'][0]}")
100
+ print("DEMO PASS" if ok else "DEMO FAIL")
101
+ return 0 if ok else 1
102
+
103
+
104
+ def main() -> int:
105
+ ap = argparse.ArgumentParser(description="Confidence-weighted rating→AUC monotonicity probe.")
106
+ ap.add_argument("--encoding", help="JSON score-definition file (see module docstring)")
107
+ ap.add_argument("--demo", action="store_true", help="run the 10-combination unit test")
108
+ args = ap.parse_args()
109
+
110
+ if args.demo:
111
+ return _demo()
112
+ if not args.encoding:
113
+ sys.stderr.write("ERROR: --encoding FILE or --demo is required\n")
114
+ return 2
115
+ try:
116
+ spec = json.loads(open(args.encoding, encoding="utf-8").read())
117
+ except (OSError, ValueError) as e:
118
+ sys.stderr.write(f"ERROR: cannot read encoding: {e}\n")
119
+ return 2
120
+
121
+ res = check_encoding(spec)
122
+ if res["ok"]:
123
+ print("OK: the (call × confidence) → score encoding is strictly monotonic.")
124
+ return 0
125
+ print("NON-MONOTONIC encoding (folded or mis-ordered) — this mis-estimates the AUC:")
126
+ for p in res["problems"]:
127
+ print(f" - {p}")
128
+ return 1
129
+
130
+
131
+ if __name__ == "__main__":
132
+ sys.exit(main())
@@ -0,0 +1,22 @@
1
+ """
2
+ Analysis: synthetic CLEAN fixture exercising the colorblind-safe palette.
3
+ A hex-color list must NOT be flagged HARDCODED_DATA_LITERAL even alongside a
4
+ data-file read (regression for the WONG-palette false positive).
5
+ Date: 2026-01-01
6
+ Random seed: 42
7
+ """
8
+ import numpy as np
9
+ import pandas as pd
10
+
11
+ np.random.seed(42)
12
+
13
+ # The Wong (2011) colorblind-safe palette that make-figures recommends. Eight
14
+ # string literals; the digits live inside the hex codes, not in tabular data.
15
+ WONG = ["#000000", "#E69F00", "#56B4E9", "#009E73",
16
+ "#F0E442", "#0072B2", "#D55E00", "#CC79A7"]
17
+
18
+ # portable relative path; no hand-typed numeric data
19
+ df = pd.read_csv("cohort.csv")
20
+
21
+ means = df.groupby("group")["auc"].mean()
22
+ print(WONG[0], float(means.iloc[0]))
@@ -55,5 +55,17 @@ check "exit 0 (clean .py)" test "$?" -eq 0
55
55
  python3 "$SCRIPT" --code-dir "$HERE/fixtures" --strict --quiet >/dev/null 2>&1
56
56
  check "exit 1 (--code-dir scan)" test "$?" -eq 1
57
57
 
58
+ # (5) hex-color palette + data read -> NOT HARDCODED_DATA_LITERAL (WONG-palette
59
+ # false-positive regression); the script is otherwise clean -> exit 0
60
+ PALETTE="$HERE/fixtures/gen_palette.py"
61
+ python3 "$SCRIPT" "$PALETTE" --out "$OUT" --quiet >/dev/null 2>&1
62
+ check "no HARDCODED_DATA_LITERAL on hex-color palette" python3 -c "
63
+ import json
64
+ d=json.load(open('$OUT'))
65
+ assert not any(c['verdict']=='HARDCODED_DATA_LITERAL' for c in d['claims']), 'palette flagged as data literal'
66
+ "
67
+ python3 "$SCRIPT" "$PALETTE" --strict --quiet >/dev/null 2>&1
68
+ check "exit 0 on clean palette script" test "$?" -eq 0
69
+
58
70
  echo "fail=$fail"; [[ "$fail" -eq 0 ]] && echo "ALL PASS" || echo "FAILURES: $fail"
59
71
  exit "$fail"
@@ -195,6 +195,39 @@ For an AI-system-versus-human-expert benchmark specifically, route to `/design-a
195
195
  extends this subsection with arm definition, LLM-as-judge versus human-as-judge adjudication, and a
196
196
  structured export schema.
197
197
 
198
+ **Perceptual / reader AI study — design-stage ceiling gate**
199
+
200
+ For a reader/observer/perceptual or diagnostic-accuracy AI study (visual Turing test, AI-vs-human
201
+ detection, image-provenance/deepfake, observer study), the acceptance ceiling is fixed **at design
202
+ time, not at analysis time** — excellent execution cannot lift a ceiling baked into the comparator,
203
+ the estimand, or the reader cohort. Walk these six before data lock and, for each, take the
204
+ higher-ambition option or record an explicit, defensible reason not to (set each at the impact level
205
+ of the journal you actually want):
206
+
207
+ 1. **Comparator realism (biggest lever).** A curated teaching-repository "authentic" arm scopes the
208
+ claim to "teaching-quality", not clinical. Use consecutive, de-identified clinical-acquisition
209
+ images (the real PACS spectrum), or add a clinical-spectrum validation arm.
210
+ 2. **Format / non-content confound matching.** Match every non-content attribute (aspect ratio,
211
+ resolution, compression, color profile) across arms by construction, and pre-specify a
212
+ confound-classifier ceiling check (format-only AUC must be ≪ reader AUC) as a *primary* gate.
213
+ 3. **Synthetic / index-arm denominator (survivorship).** Pre-specify how failed/low-quality
214
+ generations are counted; report the full generation denominator rather than evaluating only the
215
+ convincing survivors.
216
+ 4. **Reader independence and breadth.** Recruit an independent, non-author, multi-site (ideally
217
+ multi-national) reader cohort; collect reader characteristics; blind readers to the hypothesis
218
+ where feasible.
219
+ 5. **Estimand and power (generalize, don't condition).** Power the reader-AND-case generalization as
220
+ the **primary** estimand from the start, so the two-way interval — not a pool-conditional number —
221
+ supports the headline claim.
222
+ 6. **Novelty positioning vs scoop, and venue-fit.** Scan for close prior work at design time; if a
223
+ flagship precedent exists, make the differentiation categorical (new modality class, clinical
224
+ spectrum, outcome linkage), not incremental; pick the venue whose audience values the likely
225
+ result (a rigorous null fits a methodology-forward journal better than an impact-first one).
226
+
227
+ The meta-rule: set the comparator, the confound-matching, the reader cohort, and the estimand at the
228
+ target journal's impact level **before** data collection — do not plan to out-write a structural
229
+ ceiling in revision.
230
+
198
231
  ### Phase 3: Clinical framing
199
232
 
200
233
  Ask whether the comparator and endpoint support the stated claim:
@@ -135,11 +135,11 @@ confidential note and the recommendation are consistent.
135
135
 
136
136
  ### Phase 2A: Systematic Review / Meta-Analysis Extension
137
137
 
138
- Apply this internal-consistency-first gate (P0) plus 10-probe checklist (P1–P10) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
138
+ Apply this internal-consistency-first gate (P0) plus 17-probe checklist (P1–P17) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
139
139
 
140
140
  **SR-MA reviews almost always justify Tier 3 word budget** (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.
141
141
 
142
- **Probe detail (P0–P10), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
142
+ **Probe detail (P0–P17), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
143
143
 
144
144
  ### Phase 2B: Survival / Prognostic Model Extension
145
145
 
@@ -182,14 +182,14 @@ The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not tr
182
182
 
183
183
  ### Phase 2E: Observational / Confounding Extension
184
184
 
185
- Apply this 14-probe checklist (O1–O14) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
185
+ Apply this 16-probe checklist (O1–O16) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
186
186
 
187
187
  **Exempt**:
188
188
  - Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
189
189
  - Purely descriptive / prevalence reports with no adjusted association claim
190
190
  - Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)
191
191
 
192
- **Probe detail (O1–O14), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
192
+ **Probe detail (O1–O16), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
193
193
 
194
194
  ### Phase 2E-2: Clinical Prediction-Model Extension
195
195
 
@@ -201,7 +201,7 @@ Apply this 4-probe checklist (CP1–CP4) **only when the manuscript develops or
201
201
 
202
202
  Apply when an AI/ML **primary study** (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.
203
203
 
204
- **Probe detail (AO0–AO5), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the `exemplar_reviews/optimistic_validation_reporting.md` phrasing model and raise it as Major when it carries the headline.
204
+ **Probe detail (AO0–AO6), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the `exemplar_reviews/optimistic_validation_reporting.md` phrasing model and raise it as Major when it carries the headline.
205
205
 
206
206
  ### Phase 2H: RCT / Intervention-Trial Extension
207
207
 
@@ -211,9 +211,9 @@ Apply this 8-probe checklist (RC0–RC7) **only when the manuscript is a randomi
211
211
 
212
212
  ### Phase 2I: Diagnostic-Accuracy / Reader-Study Extension
213
213
 
214
- Apply this 6-probe checklist (D1–D6) **only when the manuscript is a diagnostic test accuracy (DTA) primary study** — an index test against a reference standard — including **multi-reader multi-case (MRMC)** reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA **meta-analysis**, use Phase 2A / `sr_ma.md`.)
214
+ Apply this 8-probe checklist (D1–D8) **only when the manuscript is a diagnostic test accuracy (DTA) primary study** — an index test against a reference standard — including **multi-reader multi-case (MRMC)** reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA **meta-analysis**, use Phase 2A / `sr_ma.md`.)
215
215
 
216
- **Probe detail (D1–D6), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md`. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the `analyze-stats` `table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure; a test-set-tuned operating threshold pairs with `exemplar_reviews/optimistic_validation_reporting.md`.
216
+ **Probe detail (D1–D8), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md`. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the `analyze-stats` `table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure; a test-set-tuned operating threshold pairs with `exemplar_reviews/optimistic_validation_reporting.md`.
217
217
 
218
218
  ### Phase 2J: Case-Report Extension
219
219
 
@@ -336,7 +336,7 @@ After drafting, verify mechanically:
336
336
  - ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
337
337
  - General Comments names ≥2 specific strengths before listing concerns
338
338
  - At most 1 typo/grammar Minor Comment, only if in formal section or systematic
339
- 9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P10 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
339
+ 9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P17 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
340
340
  10. **Radiomics-reproducibility QC** (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
341
341
  11. **Review-article QC** (if Phase 2D applied): Confirm RV1–RV9 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines). Confirm Phase 2F was run for the recommendation: when RV1 novelty is a Major in a saturated space with no distinct contribution, the recommendation is escalated toward Reject (the contribution IS the product — weak novelty is unfixable-in-current-form), not defaulted to the revision/Reconsider tier.
342
342
  12. **AI/method/review priority QC**: Before a Major Revision (or Reconsider) recommendation, confirm Phase 2F
@@ -406,7 +406,7 @@ For radiomic feature-reproducibility / phantom parameter-sweep / reliability-fil
406
406
 
407
407
  For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 9-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling, curated-base circularity) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.
408
408
 
409
- For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 12-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
409
+ For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 16-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale, selection on modality/procedure availability, serial-imaging lesion-tracking), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
410
410
 
411
411
  For cross-modality image-synthesis manuscripts (MRI→PET / MRI→CT / non-contrast→contrast / low-dose→full-dose) that claim functional/molecular information or a substitute for the unavailable target modality, also apply the Phase 2K 4-probe audit (IS1 determinism/information-ceiling vs a source→label baseline, IS2 target-derived-preprocessing/slice-selection leakage, IS3 global vs lesion-level quantitative agreement, IS4 mechanistic/proxy-signal plausibility); IS2 and IS4 are typically unfixable-in-current-form and govern the recommendation per Phase 2F.
412
412
 
@@ -6,7 +6,7 @@
6
6
  - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
7
7
  Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
8
8
 
9
- # AI / ML overclaiming probes (AO0–AO5)
9
+ # AI / ML overclaiming probes (AO0–AO6)
10
10
 
11
11
  A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary studies (diagnostic, prognostic, triage, detection) where the **conclusion's reach exceeds the evidence**. These probes complement (do not replace) the generic Phase 2 issue checklist and the signature "Overclaiming vs evidence level" check. The aim is to keep a framing-level over-reach from passing as a wording nitpick: a paper can report sound metrics yet draw a clinical claim — generalizable, outperforms clinicians, deployment-ready — that the design does not support, and that claim is what a reader carries away. AO1–AO4 target over-reach in the *claim sentences*; AO5 targets over-reach baked into the *reported metric itself* (an optimistically- or unreproducibly-reported number that makes the result look stronger than a faithful estimate). Run AO0 first.
12
12
 
@@ -43,6 +43,12 @@ A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary st
43
43
  - (d) **Code-vs-claims fidelity**: where code is released, does the described tuning/metric match it? Common mismatches: a claimed hyperparameter search the code does not run; a metric (e.g., specificity) attributed to a library function that does not compute it. A confirmed mismatch is an integrity/reproducibility flag — verify against the released code before asserting it.
44
44
  - Severity: MAJOR when the load-bearing performance claim rests on a best-fold number, an unstated/test-tuned threshold, a rebalanced-accuracy headline, or a code-vs-claims mismatch (the reported result is optimistic or not reproducible); MINOR when cross-validation was sound and only the cross-fold summary, the operating point, or a class-aware metric is missing from the write-up.
45
45
 
46
+ **AO6 — Arm-defining task vs deployment workflow (construct validity of the evaluation)**:
47
+ - Distinct from AO3 (model-task ≠ human-task *framing*) and from scope-coherence (claim ≠ result): AO6 asks whether the **task that defines the study arms mirrors the deployment workflow the claim targets**, or an artificial handicap/selection. Two recurrent failure modes:
48
+ - (a) **Handicapped arm** — the AI (or comparator) arm is operationalized in a way the real workflow never imposes: e.g., AI read in pure blind interpretation while the actual deployment provides clinical context / priors / the report, so the evaluation measures a task no one performs.
49
+ - (b) **Success-conditioned selection** — the arm or the analyzed subset is gated on an AI-success condition (cases where the model produced an output, segmentations that "passed", studies the pipeline did not fail on), so the comparison is conditioned on the very thing under test.
50
+ - This is a **design/paradigm-level** defect: the operationalized task, not the prose, is mis-specified, so it cannot be fixed by rewording the claim — escalate **past an ordinary Major** (editors read it as a Reject-grade construct-validity failure; a panel that files it as a fixable Major under-rates it). The fix is a re-designed arm whose task matches the intended deployment workflow and an unconditioned (consecutive / intention-to-diagnose) analysis set.
51
+
46
52
  ## Decision-impact / early-deployment probes (DECIDE-AI axis, DI1–DI5)
47
53
 
48
54
  Co-apply when a study claims **clinical utility, deployment, or decision impact** of an AI system, or *is* an early-stage live clinical evaluation. The reporting axis is then DECIDE-AI (early-stage clinical evaluation of AI decision-support); these probes check that a utility/deployment claim rests on real-use evidence, not retrospective accuracy. They sharpen AO4 for the deployment-evaluation case.
@@ -6,7 +6,7 @@
6
6
  - self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
7
7
  Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
8
8
 
9
- # Diagnostic-accuracy / reader-study probes (D1–D6)
9
+ # Diagnostic-accuracy / reader-study probes (D1–D8)
10
10
 
11
11
  A checklist for **diagnostic test accuracy (DTA) primary studies** — an index test against a reference standard, including **multi-reader multi-case (MRMC)** reader studies (e.g., AI-vs-reader or modality-comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target the biases QUADAS-2 names and the MRMC design/variance issues a reader study adds. Pairs the `analyze-stats` `table-standards/table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure. (For a DTA **meta-analysis**, use `sr_ma.md`.)
12
12
 
@@ -38,9 +38,14 @@ A checklist for **diagnostic test accuracy (DTA) primary studies** — an index
38
38
  - Typical signatures: "patients were included if [index score] ≥ T" in a paper whose aim is to evaluate that index; enrolling on a positive screening test to then "validate" the screening test; defining the diseased group by the same reader/algorithm output under study.
39
39
  - This is **design-level, not a reporting fix** — escalate past an ordinary Major (a co-reviewer / editor reads it as a fatal selection/spectrum artifact). The fix is a reference standard and an enrollment criterion that are independent of the index test (a consecutive suspected-disease series), not a re-analysis.
40
40
 
41
+ **D8 — Exclusion flow-diagram ↔ Methods-prose consistency + modality-safety enumeration**:
42
+ - Cross-check the **exclusion criteria drawn in the participant flow diagram** (STROBE/STARD flow) against the **exclusion list in the Methods prose**. A criterion that appears in one but not the other (a count in the flow with no prose rationale, or a prose exclusion not reflected in the flow boxes) is a reporting inconsistency a co-reviewer catches by reading the two side by side.
43
+ - For an **imaging** study, check that **modality-specific safety contraindications** and **device/artifact exclusions** are enumerated where applicable: MR safety (pacemaker/implant, claustrophobia), iodinated/gadolinium **contrast** contraindications (renal function, allergy, pregnancy), and **image-quality/artifact** exclusions (motion, metal artifact, incomplete coverage). Silent omission of these categories in a prospective imaging cohort understates the selected spectrum.
44
+ - Severity: a flow-vs-prose exclusion mismatch is MAJOR when it changes the analytic-N or the eligible spectrum; missing modality-safety/artifact exclusion categories is MINOR–MAJOR depending on how much of the source population they remove. The fix is a reconciled exclusion list (flow == prose) plus an explicit modality-safety/artifact exclusion enumeration.
45
+
41
46
  **Output template (D2 / D6 example)**:
42
47
  > "The study uses a case-control (two-gate) design — confirmed cases versus healthy controls — rather than a consecutive series of patients in whom the diagnosis was suspected. This typically overestimates accuracy and does not reflect the intended-use spectrum, so I'd read the reported sensitivity/specificity as proof-of-concept rather than clinical accuracy, and suggest tempering the Abstract accordingly. Separately, the reader study reports a single reader-averaged AUC; because readers are a sample, I'd suggest an MRMC analysis (e.g., Obuchowski–Rockette) that accounts for both reader and case variance, with per-reader estimates shown and the unit of analysis (per-patient vs per-lesion) stated."
43
48
 
44
- **Discipline — leads vs findings (applies to D1–D6)**:
49
+ **Discipline — leads vs findings (applies to D1–D8)**:
45
50
  - A verification/blinding/spectrum concern from a quick scan is a **lead until Methods and the participant flow are read together** — distinguish under-reporting (ask to clarify) from a true design bias (MAJOR).
46
51
  - Anchor each comment to the exact bias (partial vs differential verification; single- vs two-gate; reader-averaged vs fixed-reader; per-patient vs per-lesion) and the location. Keep severity tied to what the flaw does: two-gate sampling, incorporation bias, or ignoring reader variance is design/analysis-level (MAJOR, often Major #1); an unreported reading-order detail is a clarify-request.