medsci-skills 4.7.0 → 4.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +7 -0
- package/metadata/distribution_files.json +51 -36
- package/metadata/distribution_manifest.json +1 -1
- package/package.json +1 -1
- package/skills/analyze-stats/scripts/check_generated_code.py +7 -1
- package/skills/analyze-stats/scripts/rating_monotonicity.py +132 -0
- package/skills/analyze-stats/tests/fixtures/gen_palette.py +22 -0
- package/skills/analyze-stats/tests/test_generated_code.sh +12 -0
- package/skills/design-study/SKILL.md +33 -0
- package/skills/peer-review/SKILL.md +9 -9
- package/skills/peer-review/references/domain-probes/ai_overclaiming.md +7 -1
- package/skills/peer-review/references/domain-probes/diagnostic_accuracy.md +7 -2
- package/skills/peer-review/references/domain-probes/observational_confounding.md +11 -1
- package/skills/peer-review/references/domain-probes/sr_ma.md +21 -1
- package/skills/peer-review/references/domain-probes/survival_prognostic.md +1 -0
- package/skills/self-review/SKILL.md +77 -5
- package/skills/self-review/references/domain-probes/ai_overclaiming.md +7 -1
- package/skills/self-review/references/domain-probes/diagnostic_accuracy.md +7 -2
- package/skills/self-review/references/domain-probes/observational_confounding.md +11 -1
- package/skills/self-review/references/domain-probes/sr_ma.md +21 -1
- package/skills/self-review/references/domain-probes/survival_prognostic.md +1 -0
- package/skills/self-review/scripts/check_artifact_coverage.py +91 -2
- package/skills/self-review/scripts/check_classical_style.py +20 -6
- package/skills/self-review/scripts/check_cohort_arithmetic.py +8 -1
- package/skills/self-review/scripts/check_null_calibration.py +175 -0
- package/skills/self-review/scripts/check_scope_coherence.py +22 -2
- package/skills/self-review/scripts/check_supplement_hygiene.py +255 -0
- package/skills/self-review/tests/fixtures/cohort_rate_tier_fp.md +4 -0
- package/skills/self-review/tests/fixtures/coverage_promised_stat.md +9 -0
- package/skills/self-review/tests/fixtures/coverage_promised_stat_ok.md +8 -0
- package/skills/self-review/tests/fixtures/null_bad.md +11 -0
- package/skills/self-review/tests/fixtures/null_clean.md +13 -0
- package/skills/self-review/tests/fixtures/scope_disclaimer.md +9 -0
- package/skills/self-review/tests/fixtures/style_dagger_footnote.md +11 -0
- package/skills/self-review/tests/fixtures/supp_clean.md +10 -0
- package/skills/self-review/tests/fixtures/supp_dirty.md +14 -0
- package/skills/self-review/tests/fixtures/supp_xref_body.md +4 -0
- package/skills/self-review/tests/fixtures/supp_xref_supp.md +5 -0
- package/skills/self-review/tests/test_artifact_coverage.sh +18 -0
- package/skills/self-review/tests/test_classical_style.sh +12 -0
- package/skills/self-review/tests/test_cohort_arithmetic.sh +13 -0
- package/skills/self-review/tests/test_null_calibration.sh +47 -0
- package/skills/self-review/tests/test_scope_coherence.sh +13 -0
- package/skills/self-review/tests/test_supplement_hygiene.sh +63 -0
package/README.md
CHANGED
|
@@ -11,6 +11,7 @@
|
|
|
11
11
|
[](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
|
|
12
12
|

|
|
13
13
|
[](https://www.npmjs.com/package/medsci-skills)
|
|
14
|
+
[](https://youtu.be/MclQ_RIofpE)
|
|
14
15
|
[](https://github.com/Aperivue/medsci-skills/contribute)
|
|
15
16
|
|
|
16
17
|
[](https://agentskills.io)
|
|
@@ -267,6 +268,12 @@ The E2E pipeline (`orchestrate --e2e`) produces everything up to `qc/`. The `sub
|
|
|
267
268
|
|
|
268
269
|
## What's New
|
|
269
270
|
|
|
271
|
+
**v4.8** is the **review-harvest batch** — deterministic detector hardening promoted from real-manuscript review cycles. Additive and backward-compatible; still 45 skills / 36 guidelines, analysis-integrity detectors **30 → 32**:
|
|
272
|
+
|
|
273
|
+
- **Two new gates** — `check_supplement_hygiene.py` lints the rendered supplement / tables / caption files (not just the manuscript) for §-labels, placeholders, build markers, response-letter framing, and unresolved body↔supplement cross-references; `check_null_calibration.py` flags a headline negative/equivalence claim made without a minimum-detectable-effect / power / equivalence statement.
|
|
274
|
+
- **Four detector false-positive fixes** — gates no longer fire on a recommended colorblind-safe palette, author-footnote `§` daggers, a correctly-hedged disclaimer, or a tier-label digit; each with a regression fixture and three newly CI-wired test suites.
|
|
275
|
+
- **Nine reviewer-side domain probes** (SR/MA, observational, diagnostic, AI-overclaiming, survival) plus a `/design-study` design-stage ceiling gate for perceptual/reader-AI studies and a reusable confidence-weighted-rating→AUC monotonicity template.
|
|
276
|
+
|
|
270
277
|
**v4.7** is the **self-update foundation** — physician-researchers stay current without GitHub, git, or a terminal. Additive and backward-compatible; still 45 skills / 36 guidelines / 30 detectors:
|
|
271
278
|
|
|
272
279
|
- **Transactional, crash-recoverable installer.** Each install runs through a durable journal state machine recovered on the next run (roll back / forward-clean / fail-closed), with per-target SHA-256 inventories — your modified or third-party skills are backed up and never clobbered or auto-deleted.
|
|
@@ -358,8 +358,13 @@
|
|
|
358
358
|
},
|
|
359
359
|
{
|
|
360
360
|
"path": "skills/analyze-stats/scripts/check_generated_code.py",
|
|
361
|
-
"size":
|
|
362
|
-
"sha256": "
|
|
361
|
+
"size": 15024,
|
|
362
|
+
"sha256": "b8643d9d9a4b600af428604811135fc27cee6eb2140bdae5e393ed9b3be37fae"
|
|
363
|
+
},
|
|
364
|
+
{
|
|
365
|
+
"path": "skills/analyze-stats/scripts/rating_monotonicity.py",
|
|
366
|
+
"size": 5494,
|
|
367
|
+
"sha256": "b7f8f36c7af060e2dec60f51185ef4cc5a0f2ef20c1d509fb845b36521f420b3"
|
|
363
368
|
},
|
|
364
369
|
{
|
|
365
370
|
"path": "skills/analyze-stats/skill.yml",
|
|
@@ -868,8 +873,8 @@
|
|
|
868
873
|
},
|
|
869
874
|
{
|
|
870
875
|
"path": "skills/design-study/SKILL.md",
|
|
871
|
-
"size":
|
|
872
|
-
"sha256": "
|
|
876
|
+
"size": 18282,
|
|
877
|
+
"sha256": "dfd73871102af6beab3d21f60f650db41ec2e4025d6ee19eae8de61b17851fd4"
|
|
873
878
|
},
|
|
874
879
|
{
|
|
875
880
|
"path": "skills/design-study/skill.yml",
|
|
@@ -2363,8 +2368,8 @@
|
|
|
2363
2368
|
},
|
|
2364
2369
|
{
|
|
2365
2370
|
"path": "skills/peer-review/SKILL.md",
|
|
2366
|
-
"size":
|
|
2367
|
-
"sha256": "
|
|
2371
|
+
"size": 50304,
|
|
2372
|
+
"sha256": "2cabbeb0695e52e6a4aa1f341a0ddeb3cfe088c53e25c9803a2b1b5f3a0b5bd1"
|
|
2368
2373
|
},
|
|
2369
2374
|
{
|
|
2370
2375
|
"path": "skills/peer-review/references/aczel_2021_reviewer2_patterns.md",
|
|
@@ -2373,8 +2378,8 @@
|
|
|
2373
2378
|
},
|
|
2374
2379
|
{
|
|
2375
2380
|
"path": "skills/peer-review/references/domain-probes/ai_overclaiming.md",
|
|
2376
|
-
"size":
|
|
2377
|
-
"sha256": "
|
|
2381
|
+
"size": 14205,
|
|
2382
|
+
"sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
|
|
2378
2383
|
},
|
|
2379
2384
|
{
|
|
2380
2385
|
"path": "skills/peer-review/references/domain-probes/case_report.md",
|
|
@@ -2388,8 +2393,8 @@
|
|
|
2388
2393
|
},
|
|
2389
2394
|
{
|
|
2390
2395
|
"path": "skills/peer-review/references/domain-probes/diagnostic_accuracy.md",
|
|
2391
|
-
"size":
|
|
2392
|
-
"sha256": "
|
|
2396
|
+
"size": 9002,
|
|
2397
|
+
"sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
|
|
2393
2398
|
},
|
|
2394
2399
|
{
|
|
2395
2400
|
"path": "skills/peer-review/references/domain-probes/equity_fairness.md",
|
|
@@ -2408,8 +2413,8 @@
|
|
|
2408
2413
|
},
|
|
2409
2414
|
{
|
|
2410
2415
|
"path": "skills/peer-review/references/domain-probes/observational_confounding.md",
|
|
2411
|
-
"size":
|
|
2412
|
-
"sha256": "
|
|
2416
|
+
"size": 27378,
|
|
2417
|
+
"sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
|
|
2413
2418
|
},
|
|
2414
2419
|
{
|
|
2415
2420
|
"path": "skills/peer-review/references/domain-probes/radiomics.md",
|
|
@@ -2423,13 +2428,13 @@
|
|
|
2423
2428
|
},
|
|
2424
2429
|
{
|
|
2425
2430
|
"path": "skills/peer-review/references/domain-probes/sr_ma.md",
|
|
2426
|
-
"size":
|
|
2427
|
-
"sha256": "
|
|
2431
|
+
"size": 17239,
|
|
2432
|
+
"sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
|
|
2428
2433
|
},
|
|
2429
2434
|
{
|
|
2430
2435
|
"path": "skills/peer-review/references/domain-probes/survival_prognostic.md",
|
|
2431
|
-
"size":
|
|
2432
|
-
"sha256": "
|
|
2436
|
+
"size": 13765,
|
|
2437
|
+
"sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
|
|
2433
2438
|
},
|
|
2434
2439
|
{
|
|
2435
2440
|
"path": "skills/peer-review/references/exemplar_reviews/README.md",
|
|
@@ -2843,13 +2848,13 @@
|
|
|
2843
2848
|
},
|
|
2844
2849
|
{
|
|
2845
2850
|
"path": "skills/self-review/SKILL.md",
|
|
2846
|
-
"size":
|
|
2847
|
-
"sha256": "
|
|
2851
|
+
"size": 89813,
|
|
2852
|
+
"sha256": "54bcd9c6e751555044b9b436db9e8f10e0be280b60c00048c68de5570512bf16"
|
|
2848
2853
|
},
|
|
2849
2854
|
{
|
|
2850
2855
|
"path": "skills/self-review/references/domain-probes/ai_overclaiming.md",
|
|
2851
|
-
"size":
|
|
2852
|
-
"sha256": "
|
|
2856
|
+
"size": 14205,
|
|
2857
|
+
"sha256": "565aa362e20ea6ca923510ea393829fb851e0caaae3d732f99144dd48a25b951"
|
|
2853
2858
|
},
|
|
2854
2859
|
{
|
|
2855
2860
|
"path": "skills/self-review/references/domain-probes/case_report.md",
|
|
@@ -2863,8 +2868,8 @@
|
|
|
2863
2868
|
},
|
|
2864
2869
|
{
|
|
2865
2870
|
"path": "skills/self-review/references/domain-probes/diagnostic_accuracy.md",
|
|
2866
|
-
"size":
|
|
2867
|
-
"sha256": "
|
|
2871
|
+
"size": 9002,
|
|
2872
|
+
"sha256": "8232e2023d3c4d52e6a2d9003ae335302d6fe78c54d39b81f591c07becaf4df0"
|
|
2868
2873
|
},
|
|
2869
2874
|
{
|
|
2870
2875
|
"path": "skills/self-review/references/domain-probes/equity_fairness.md",
|
|
@@ -2883,8 +2888,8 @@
|
|
|
2883
2888
|
},
|
|
2884
2889
|
{
|
|
2885
2890
|
"path": "skills/self-review/references/domain-probes/observational_confounding.md",
|
|
2886
|
-
"size":
|
|
2887
|
-
"sha256": "
|
|
2891
|
+
"size": 27378,
|
|
2892
|
+
"sha256": "bbced9817545c65992e7c3430057e32b9f32e092180bc91bd08730f03b2dcf9a"
|
|
2888
2893
|
},
|
|
2889
2894
|
{
|
|
2890
2895
|
"path": "skills/self-review/references/domain-probes/radiomics.md",
|
|
@@ -2898,13 +2903,13 @@
|
|
|
2898
2903
|
},
|
|
2899
2904
|
{
|
|
2900
2905
|
"path": "skills/self-review/references/domain-probes/sr_ma.md",
|
|
2901
|
-
"size":
|
|
2902
|
-
"sha256": "
|
|
2906
|
+
"size": 17239,
|
|
2907
|
+
"sha256": "486d569f559f16d62b882f7256ccfe7443e3e2b4cc72a407a5cd046162a98b25"
|
|
2903
2908
|
},
|
|
2904
2909
|
{
|
|
2905
2910
|
"path": "skills/self-review/references/domain-probes/survival_prognostic.md",
|
|
2906
|
-
"size":
|
|
2907
|
-
"sha256": "
|
|
2911
|
+
"size": 13765,
|
|
2912
|
+
"sha256": "af0e0323876424eec8ba07be22fe99b3100356e8e4895b5b79635e9a2d76e98e"
|
|
2908
2913
|
},
|
|
2909
2914
|
{
|
|
2910
2915
|
"path": "skills/self-review/references/exemplar_findings/README.md",
|
|
@@ -2953,8 +2958,8 @@
|
|
|
2953
2958
|
},
|
|
2954
2959
|
{
|
|
2955
2960
|
"path": "skills/self-review/scripts/check_artifact_coverage.py",
|
|
2956
|
-
"size":
|
|
2957
|
-
"sha256": "
|
|
2961
|
+
"size": 17113,
|
|
2962
|
+
"sha256": "56096c39ddb0083c04a1254f06bafa6fac9fc8a136c9246f68773f0ba5da96d4"
|
|
2958
2963
|
},
|
|
2959
2964
|
{
|
|
2960
2965
|
"path": "skills/self-review/scripts/check_claim_artifact.py",
|
|
@@ -2963,19 +2968,24 @@
|
|
|
2963
2968
|
},
|
|
2964
2969
|
{
|
|
2965
2970
|
"path": "skills/self-review/scripts/check_classical_style.py",
|
|
2966
|
-
"size":
|
|
2967
|
-
"sha256": "
|
|
2971
|
+
"size": 10953,
|
|
2972
|
+
"sha256": "3b5c85edc57ee607a2b0a10898d68ac163ce3be6fe50b82c8c11490d7bc2705a"
|
|
2968
2973
|
},
|
|
2969
2974
|
{
|
|
2970
2975
|
"path": "skills/self-review/scripts/check_cohort_arithmetic.py",
|
|
2971
|
-
"size":
|
|
2972
|
-
"sha256": "
|
|
2976
|
+
"size": 23868,
|
|
2977
|
+
"sha256": "55933d25d824bf073c0a8b7abc42d2f160c459a288ab932c50d63cf8c03afd37"
|
|
2973
2978
|
},
|
|
2974
2979
|
{
|
|
2975
2980
|
"path": "skills/self-review/scripts/check_confounding_completeness.py",
|
|
2976
2981
|
"size": 21506,
|
|
2977
2982
|
"sha256": "7d3e67074d58a28ffee52ce64b486231f103a3ddcaf6b3b6ee83ba5f89c63bc2"
|
|
2978
2983
|
},
|
|
2984
|
+
{
|
|
2985
|
+
"path": "skills/self-review/scripts/check_null_calibration.py",
|
|
2986
|
+
"size": 7594,
|
|
2987
|
+
"sha256": "9ddff01722c34efb6ffd757ae762c6ee12f5993bf13b11313c2e20b60b26cab3"
|
|
2988
|
+
},
|
|
2979
2989
|
{
|
|
2980
2990
|
"path": "skills/self-review/scripts/check_panel_diversity.py",
|
|
2981
2991
|
"size": 15324,
|
|
@@ -2998,8 +3008,13 @@
|
|
|
2998
3008
|
},
|
|
2999
3009
|
{
|
|
3000
3010
|
"path": "skills/self-review/scripts/check_scope_coherence.py",
|
|
3001
|
-
"size":
|
|
3002
|
-
"sha256": "
|
|
3011
|
+
"size": 10818,
|
|
3012
|
+
"sha256": "820dfc264c2a4f62c79c0c7123a3e1a8b59a100b89654617a08ff55deeb25a75"
|
|
3013
|
+
},
|
|
3014
|
+
{
|
|
3015
|
+
"path": "skills/self-review/scripts/check_supplement_hygiene.py",
|
|
3016
|
+
"size": 11088,
|
|
3017
|
+
"sha256": "f89027472cdf0258357c3b0f0b0f3fec09b5ea65cc1373292797b818d1acf444"
|
|
3003
3018
|
},
|
|
3004
3019
|
{
|
|
3005
3020
|
"path": "skills/self-review/skill.yml",
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "medsci-skills",
|
|
3
|
-
"version": "4.
|
|
3
|
+
"version": "4.8.0",
|
|
4
4
|
"description": "MedSci Skills — a medical/scientific research skill suite for AI coding agents (Claude Code, Codex, Cursor, Copilot). The npm package is a terminal-friendly installer shortcut; the canonical distribution remains the GitHub repository and the Claude Code plugin marketplace.",
|
|
5
5
|
"license": "SEE LICENSE IN LICENSE",
|
|
6
6
|
"homepage": "https://github.com/Aperivue/medsci-skills#readme",
|
|
@@ -145,9 +145,15 @@ def check_text_common(src: str, lang: str) -> list[dict]:
|
|
|
145
145
|
for m in NUM_LITERAL_BODY.finditer(src):
|
|
146
146
|
body = m.group(1)
|
|
147
147
|
# ignore obvious non-data: ranges, single repeated, function-call args with kwargs
|
|
148
|
-
nums = NUM_TOKEN.findall(body)
|
|
149
148
|
if "=" in body: # kwargs like figsize=(8,6) or linspace(0,1,...) — not table data
|
|
150
149
|
continue
|
|
150
|
+
# A list/tuple of string literals (e.g. a hex-color palette
|
|
151
|
+
# ['#000000','#E69F00',...] — exactly the colorblind-safe WONG palette that
|
|
152
|
+
# make-figures recommends) is NOT hand-typed tabular data. Strip quoted
|
|
153
|
+
# substrings before counting numeric tokens, so digits living inside string
|
|
154
|
+
# literals (the "00" in '#E69F00', RGBA codes, category labels) don't make a
|
|
155
|
+
# string list look table-shaped. Genuine numeric data is unquoted.
|
|
156
|
+
nums = NUM_TOKEN.findall(STR_LITERAL.sub("", body))
|
|
151
157
|
if len(nums) >= DATA_LITERAL_STANDALONE or (len(nums) >= DATA_LITERAL_MIN and has_read):
|
|
152
158
|
ln = src[:m.start()].count("\n") + 1
|
|
153
159
|
claims.append({
|
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Confidence-weighted rating → AUC monotonicity probe (reusable template).
|
|
3
|
+
|
|
4
|
+
NOT an auto-discovered detector — it needs the score *encoding* as input, so it is
|
|
5
|
+
a template/helper you point at your own score definition, not a manuscript scanner.
|
|
6
|
+
|
|
7
|
+
When an observer / reader study collapses a (binary-call × confidence) rating into a
|
|
8
|
+
single score used as the ROC/AUC predictor, that score MUST be strictly monotonic in
|
|
9
|
+
"evidence for the positive label" across the full ladder:
|
|
10
|
+
|
|
11
|
+
negative-call, highest confidence = strongest evidence AGAINST positive = LOWEST
|
|
12
|
+
…
|
|
13
|
+
negative-call, lowest confidence
|
|
14
|
+
positive-call, lowest confidence
|
|
15
|
+
…
|
|
16
|
+
positive-call, highest confidence = strongest evidence FOR positive = HIGHEST
|
|
17
|
+
|
|
18
|
+
A *folded* score — the classic bug `cws = confidence if positive_call else (K+1) − confidence`
|
|
19
|
+
— makes negative/high-confidence collide with positive/low-confidence and is NOT
|
|
20
|
+
monotonic; it understates discrimination and can flip a gradient to equivalence. Prose
|
|
21
|
+
review cannot see this; re-checking the encoding does.
|
|
22
|
+
|
|
23
|
+
Input JSON (`--encoding score_def.json`):
|
|
24
|
+
{
|
|
25
|
+
"confidence_levels": [1, 2, 3, 4, 5],
|
|
26
|
+
"scores": {
|
|
27
|
+
"positive": {"1": 6, "2": 7, "3": 8, "4": 9, "5": 10},
|
|
28
|
+
"negative": {"1": 5, "2": 4, "3": 3, "4": 2, "5": 1}
|
|
29
|
+
}
|
|
30
|
+
}
|
|
31
|
+
`scores[call][confidence]` is the numeric value fed to the ROC/AUC routine for that cell.
|
|
32
|
+
|
|
33
|
+
Exit codes: 0 monotonic, 1 a collision/inversion (folded or mis-encoded), 2 usage.
|
|
34
|
+
Stdlib-only (json / argparse / sys). Run `--demo` for the 10-combination unit test.
|
|
35
|
+
"""
|
|
36
|
+
|
|
37
|
+
from __future__ import annotations
|
|
38
|
+
|
|
39
|
+
import argparse
|
|
40
|
+
import json
|
|
41
|
+
import sys
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
def evidence_order(levels: list) -> list[tuple[str, object]]:
|
|
45
|
+
"""Cells ordered from strongest-against-positive to strongest-for-positive."""
|
|
46
|
+
asc = sorted(levels, key=lambda x: float(x))
|
|
47
|
+
# negative call: highest confidence first (strongest against) → lowest
|
|
48
|
+
neg = [("negative", c) for c in reversed(asc)]
|
|
49
|
+
# positive call: lowest confidence first → highest (strongest for)
|
|
50
|
+
pos = [("positive", c) for c in asc]
|
|
51
|
+
return neg + pos
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
def check_encoding(spec: dict) -> dict:
|
|
55
|
+
levels = spec["confidence_levels"]
|
|
56
|
+
scores = spec["scores"]
|
|
57
|
+
order = evidence_order(levels)
|
|
58
|
+
seq = []
|
|
59
|
+
for call, conf in order:
|
|
60
|
+
# JSON object keys are strings; tolerate int/str confidence keys.
|
|
61
|
+
cell = scores[call]
|
|
62
|
+
val = cell.get(str(conf), cell.get(conf))
|
|
63
|
+
if val is None:
|
|
64
|
+
return {"ok": False, "problems": [f"missing score for {call}/conf {conf}"],
|
|
65
|
+
"sequence": seq}
|
|
66
|
+
seq.append({"call": call, "confidence": conf, "score": val})
|
|
67
|
+
problems = []
|
|
68
|
+
for a, b in zip(seq, seq[1:]):
|
|
69
|
+
if b["score"] == a["score"]:
|
|
70
|
+
problems.append(
|
|
71
|
+
f"collision: {a['call']}/{a['confidence']} and {b['call']}/{b['confidence']} "
|
|
72
|
+
f"share score {a['score']} (a folded/mirrored encoding maps opposite evidence "
|
|
73
|
+
f"to the same value)")
|
|
74
|
+
elif b["score"] < a["score"]:
|
|
75
|
+
problems.append(
|
|
76
|
+
f"inversion: {a['call']}/{a['confidence']} (={a['score']}) ranks above "
|
|
77
|
+
f"{b['call']}/{b['confidence']} (={b['score']}) but carries weaker evidence "
|
|
78
|
+
f"for the positive label")
|
|
79
|
+
return {"ok": not problems, "problems": problems,
|
|
80
|
+
"sequence": [(s["call"], s["confidence"], s["score"]) for s in seq]}
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
def _demo() -> int:
|
|
84
|
+
"""The 10-combination (K=5) unit test: a correct directional encoding passes,
|
|
85
|
+
the folded encoding fails on collisions."""
|
|
86
|
+
levels = [1, 2, 3, 4, 5]
|
|
87
|
+
correct = {"confidence_levels": levels,
|
|
88
|
+
"scores": {"positive": {str(c): 5 + c for c in levels},
|
|
89
|
+
"negative": {str(c): 6 - c for c in levels}}}
|
|
90
|
+
folded = {"confidence_levels": levels,
|
|
91
|
+
"scores": {"positive": {str(c): c for c in levels},
|
|
92
|
+
"negative": {str(c): 6 - c for c in levels}}}
|
|
93
|
+
rc = check_encoding(correct)
|
|
94
|
+
rf = check_encoding(folded)
|
|
95
|
+
ok = rc["ok"] and not rf["ok"]
|
|
96
|
+
print(f"correct directional encoding: monotonic={rc['ok']} (expected True)")
|
|
97
|
+
print(f"folded encoding: monotonic={rf['ok']} (expected False)")
|
|
98
|
+
if rf["problems"]:
|
|
99
|
+
print(f" folded problems[0]: {rf['problems'][0]}")
|
|
100
|
+
print("DEMO PASS" if ok else "DEMO FAIL")
|
|
101
|
+
return 0 if ok else 1
|
|
102
|
+
|
|
103
|
+
|
|
104
|
+
def main() -> int:
|
|
105
|
+
ap = argparse.ArgumentParser(description="Confidence-weighted rating→AUC monotonicity probe.")
|
|
106
|
+
ap.add_argument("--encoding", help="JSON score-definition file (see module docstring)")
|
|
107
|
+
ap.add_argument("--demo", action="store_true", help="run the 10-combination unit test")
|
|
108
|
+
args = ap.parse_args()
|
|
109
|
+
|
|
110
|
+
if args.demo:
|
|
111
|
+
return _demo()
|
|
112
|
+
if not args.encoding:
|
|
113
|
+
sys.stderr.write("ERROR: --encoding FILE or --demo is required\n")
|
|
114
|
+
return 2
|
|
115
|
+
try:
|
|
116
|
+
spec = json.loads(open(args.encoding, encoding="utf-8").read())
|
|
117
|
+
except (OSError, ValueError) as e:
|
|
118
|
+
sys.stderr.write(f"ERROR: cannot read encoding: {e}\n")
|
|
119
|
+
return 2
|
|
120
|
+
|
|
121
|
+
res = check_encoding(spec)
|
|
122
|
+
if res["ok"]:
|
|
123
|
+
print("OK: the (call × confidence) → score encoding is strictly monotonic.")
|
|
124
|
+
return 0
|
|
125
|
+
print("NON-MONOTONIC encoding (folded or mis-ordered) — this mis-estimates the AUC:")
|
|
126
|
+
for p in res["problems"]:
|
|
127
|
+
print(f" - {p}")
|
|
128
|
+
return 1
|
|
129
|
+
|
|
130
|
+
|
|
131
|
+
if __name__ == "__main__":
|
|
132
|
+
sys.exit(main())
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Analysis: synthetic CLEAN fixture exercising the colorblind-safe palette.
|
|
3
|
+
A hex-color list must NOT be flagged HARDCODED_DATA_LITERAL even alongside a
|
|
4
|
+
data-file read (regression for the WONG-palette false positive).
|
|
5
|
+
Date: 2026-01-01
|
|
6
|
+
Random seed: 42
|
|
7
|
+
"""
|
|
8
|
+
import numpy as np
|
|
9
|
+
import pandas as pd
|
|
10
|
+
|
|
11
|
+
np.random.seed(42)
|
|
12
|
+
|
|
13
|
+
# The Wong (2011) colorblind-safe palette that make-figures recommends. Eight
|
|
14
|
+
# string literals; the digits live inside the hex codes, not in tabular data.
|
|
15
|
+
WONG = ["#000000", "#E69F00", "#56B4E9", "#009E73",
|
|
16
|
+
"#F0E442", "#0072B2", "#D55E00", "#CC79A7"]
|
|
17
|
+
|
|
18
|
+
# portable relative path; no hand-typed numeric data
|
|
19
|
+
df = pd.read_csv("cohort.csv")
|
|
20
|
+
|
|
21
|
+
means = df.groupby("group")["auc"].mean()
|
|
22
|
+
print(WONG[0], float(means.iloc[0]))
|
|
@@ -55,5 +55,17 @@ check "exit 0 (clean .py)" test "$?" -eq 0
|
|
|
55
55
|
python3 "$SCRIPT" --code-dir "$HERE/fixtures" --strict --quiet >/dev/null 2>&1
|
|
56
56
|
check "exit 1 (--code-dir scan)" test "$?" -eq 1
|
|
57
57
|
|
|
58
|
+
# (5) hex-color palette + data read -> NOT HARDCODED_DATA_LITERAL (WONG-palette
|
|
59
|
+
# false-positive regression); the script is otherwise clean -> exit 0
|
|
60
|
+
PALETTE="$HERE/fixtures/gen_palette.py"
|
|
61
|
+
python3 "$SCRIPT" "$PALETTE" --out "$OUT" --quiet >/dev/null 2>&1
|
|
62
|
+
check "no HARDCODED_DATA_LITERAL on hex-color palette" python3 -c "
|
|
63
|
+
import json
|
|
64
|
+
d=json.load(open('$OUT'))
|
|
65
|
+
assert not any(c['verdict']=='HARDCODED_DATA_LITERAL' for c in d['claims']), 'palette flagged as data literal'
|
|
66
|
+
"
|
|
67
|
+
python3 "$SCRIPT" "$PALETTE" --strict --quiet >/dev/null 2>&1
|
|
68
|
+
check "exit 0 on clean palette script" test "$?" -eq 0
|
|
69
|
+
|
|
58
70
|
echo "fail=$fail"; [[ "$fail" -eq 0 ]] && echo "ALL PASS" || echo "FAILURES: $fail"
|
|
59
71
|
exit "$fail"
|
|
@@ -195,6 +195,39 @@ For an AI-system-versus-human-expert benchmark specifically, route to `/design-a
|
|
|
195
195
|
extends this subsection with arm definition, LLM-as-judge versus human-as-judge adjudication, and a
|
|
196
196
|
structured export schema.
|
|
197
197
|
|
|
198
|
+
**Perceptual / reader AI study — design-stage ceiling gate**
|
|
199
|
+
|
|
200
|
+
For a reader/observer/perceptual or diagnostic-accuracy AI study (visual Turing test, AI-vs-human
|
|
201
|
+
detection, image-provenance/deepfake, observer study), the acceptance ceiling is fixed **at design
|
|
202
|
+
time, not at analysis time** — excellent execution cannot lift a ceiling baked into the comparator,
|
|
203
|
+
the estimand, or the reader cohort. Walk these six before data lock and, for each, take the
|
|
204
|
+
higher-ambition option or record an explicit, defensible reason not to (set each at the impact level
|
|
205
|
+
of the journal you actually want):
|
|
206
|
+
|
|
207
|
+
1. **Comparator realism (biggest lever).** A curated teaching-repository "authentic" arm scopes the
|
|
208
|
+
claim to "teaching-quality", not clinical. Use consecutive, de-identified clinical-acquisition
|
|
209
|
+
images (the real PACS spectrum), or add a clinical-spectrum validation arm.
|
|
210
|
+
2. **Format / non-content confound matching.** Match every non-content attribute (aspect ratio,
|
|
211
|
+
resolution, compression, color profile) across arms by construction, and pre-specify a
|
|
212
|
+
confound-classifier ceiling check (format-only AUC must be ≪ reader AUC) as a *primary* gate.
|
|
213
|
+
3. **Synthetic / index-arm denominator (survivorship).** Pre-specify how failed/low-quality
|
|
214
|
+
generations are counted; report the full generation denominator rather than evaluating only the
|
|
215
|
+
convincing survivors.
|
|
216
|
+
4. **Reader independence and breadth.** Recruit an independent, non-author, multi-site (ideally
|
|
217
|
+
multi-national) reader cohort; collect reader characteristics; blind readers to the hypothesis
|
|
218
|
+
where feasible.
|
|
219
|
+
5. **Estimand and power (generalize, don't condition).** Power the reader-AND-case generalization as
|
|
220
|
+
the **primary** estimand from the start, so the two-way interval — not a pool-conditional number —
|
|
221
|
+
supports the headline claim.
|
|
222
|
+
6. **Novelty positioning vs scoop, and venue-fit.** Scan for close prior work at design time; if a
|
|
223
|
+
flagship precedent exists, make the differentiation categorical (new modality class, clinical
|
|
224
|
+
spectrum, outcome linkage), not incremental; pick the venue whose audience values the likely
|
|
225
|
+
result (a rigorous null fits a methodology-forward journal better than an impact-first one).
|
|
226
|
+
|
|
227
|
+
The meta-rule: set the comparator, the confound-matching, the reader cohort, and the estimand at the
|
|
228
|
+
target journal's impact level **before** data collection — do not plan to out-write a structural
|
|
229
|
+
ceiling in revision.
|
|
230
|
+
|
|
198
231
|
### Phase 3: Clinical framing
|
|
199
232
|
|
|
200
233
|
Ask whether the comparator and endpoint support the stated claim:
|
|
@@ -135,11 +135,11 @@ confidential note and the recommendation are consistent.
|
|
|
135
135
|
|
|
136
136
|
### Phase 2A: Systematic Review / Meta-Analysis Extension
|
|
137
137
|
|
|
138
|
-
Apply this internal-consistency-first gate (P0) plus
|
|
138
|
+
Apply this internal-consistency-first gate (P0) plus 17-probe checklist (P1–P17) **only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis"**. These probes complement (do not replace) the generic Phase 2 issue checklist.
|
|
139
139
|
|
|
140
140
|
**SR-MA reviews almost always justify Tier 3 word budget** (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.
|
|
141
141
|
|
|
142
|
-
**Probe detail (P0–
|
|
142
|
+
**Probe detail (P0–P17), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md`. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
|
|
143
143
|
|
|
144
144
|
### Phase 2B: Survival / Prognostic Model Extension
|
|
145
145
|
|
|
@@ -182,14 +182,14 @@ The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not tr
|
|
|
182
182
|
|
|
183
183
|
### Phase 2E: Observational / Confounding Extension
|
|
184
184
|
|
|
185
|
-
Apply this
|
|
185
|
+
Apply this 16-probe checklist (O1–O16) **only when the manuscript is an observational study** (cohort, case-control, cross-sectional, health-screening / registry) **whose central claim is an adjusted exposure–outcome association** estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
|
|
186
186
|
|
|
187
187
|
**Exempt**:
|
|
188
188
|
- Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
|
|
189
189
|
- Purely descriptive / prevalence reports with no adjusted association claim
|
|
190
190
|
- Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)
|
|
191
191
|
|
|
192
|
-
**Probe detail (O1–
|
|
192
|
+
**Probe detail (O1–O16), with output templates:** `${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md`. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
|
|
193
193
|
|
|
194
194
|
### Phase 2E-2: Clinical Prediction-Model Extension
|
|
195
195
|
|
|
@@ -201,7 +201,7 @@ Apply this 4-probe checklist (CP1–CP4) **only when the manuscript develops or
|
|
|
201
201
|
|
|
202
202
|
Apply when an AI/ML **primary study** (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.
|
|
203
203
|
|
|
204
|
-
**Probe detail (AO0–
|
|
204
|
+
**Probe detail (AO0–AO6), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md`. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the `exemplar_reviews/optimistic_validation_reporting.md` phrasing model and raise it as Major when it carries the headline.
|
|
205
205
|
|
|
206
206
|
### Phase 2H: RCT / Intervention-Trial Extension
|
|
207
207
|
|
|
@@ -211,9 +211,9 @@ Apply this 8-probe checklist (RC0–RC7) **only when the manuscript is a randomi
|
|
|
211
211
|
|
|
212
212
|
### Phase 2I: Diagnostic-Accuracy / Reader-Study Extension
|
|
213
213
|
|
|
214
|
-
Apply this
|
|
214
|
+
Apply this 8-probe checklist (D1–D8) **only when the manuscript is a diagnostic test accuracy (DTA) primary study** — an index test against a reference standard — including **multi-reader multi-case (MRMC)** reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA **meta-analysis**, use Phase 2A / `sr_ma.md`.)
|
|
215
215
|
|
|
216
|
-
**Probe detail (D1–
|
|
216
|
+
**Probe detail (D1–D8), with output templates and the leads-vs-findings discipline:** `${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md`. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the `analyze-stats` `table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure; a test-set-tuned operating threshold pairs with `exemplar_reviews/optimistic_validation_reporting.md`.
|
|
217
217
|
|
|
218
218
|
### Phase 2J: Case-Report Extension
|
|
219
219
|
|
|
@@ -336,7 +336,7 @@ After drafting, verify mechanically:
|
|
|
336
336
|
- ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
|
|
337
337
|
- General Comments names ≥2 specific strengths before listing concerns
|
|
338
338
|
- At most 1 typo/grammar Minor Comment, only if in formal section or systematic
|
|
339
|
-
9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–
|
|
339
|
+
9. **SR-MA-specific QC** (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P17 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
|
|
340
340
|
10. **Radiomics-reproducibility QC** (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
|
|
341
341
|
11. **Review-article QC** (if Phase 2D applied): Confirm RV1–RV9 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines). Confirm Phase 2F was run for the recommendation: when RV1 novelty is a Major in a saturated space with no distinct contribution, the recommendation is escalated toward Reject (the contribution IS the product — weak novelty is unfixable-in-current-form), not defaulted to the revision/Reconsider tier.
|
|
342
342
|
12. **AI/method/review priority QC**: Before a Major Revision (or Reconsider) recommendation, confirm Phase 2F
|
|
@@ -406,7 +406,7 @@ For radiomic feature-reproducibility / phantom parameter-sweep / reliability-fil
|
|
|
406
406
|
|
|
407
407
|
For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 9-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling, curated-base circularity) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.
|
|
408
408
|
|
|
409
|
-
For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E
|
|
409
|
+
For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 16-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale, selection on modality/procedure availability, serial-imaging lesion-tracking), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
|
|
410
410
|
|
|
411
411
|
For cross-modality image-synthesis manuscripts (MRI→PET / MRI→CT / non-contrast→contrast / low-dose→full-dose) that claim functional/molecular information or a substitute for the unavailable target modality, also apply the Phase 2K 4-probe audit (IS1 determinism/information-ceiling vs a source→label baseline, IS2 target-derived-preprocessing/slice-selection leakage, IS3 global vs lesion-level quantitative agreement, IS4 mechanistic/proxy-signal plausibility); IS2 and IS4 are typically unfixable-in-current-form and govern the recommendation per Phase 2F.
|
|
412
412
|
|
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
- self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
|
|
7
7
|
Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
|
|
8
8
|
|
|
9
|
-
# AI / ML overclaiming probes (AO0–
|
|
9
|
+
# AI / ML overclaiming probes (AO0–AO6)
|
|
10
10
|
|
|
11
11
|
A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary studies (diagnostic, prognostic, triage, detection) where the **conclusion's reach exceeds the evidence**. These probes complement (do not replace) the generic Phase 2 issue checklist and the signature "Overclaiming vs evidence level" check. The aim is to keep a framing-level over-reach from passing as a wording nitpick: a paper can report sound metrics yet draw a clinical claim — generalizable, outperforms clinicians, deployment-ready — that the design does not support, and that claim is what a reader carries away. AO1–AO4 target over-reach in the *claim sentences*; AO5 targets over-reach baked into the *reported metric itself* (an optimistically- or unreproducibly-reported number that makes the result look stronger than a faithful estimate). Run AO0 first.
|
|
12
12
|
|
|
@@ -43,6 +43,12 @@ A 5-probe checklist (AO1–AO5, with AO0 as a gate) for medical-AI/ML primary st
|
|
|
43
43
|
- (d) **Code-vs-claims fidelity**: where code is released, does the described tuning/metric match it? Common mismatches: a claimed hyperparameter search the code does not run; a metric (e.g., specificity) attributed to a library function that does not compute it. A confirmed mismatch is an integrity/reproducibility flag — verify against the released code before asserting it.
|
|
44
44
|
- Severity: MAJOR when the load-bearing performance claim rests on a best-fold number, an unstated/test-tuned threshold, a rebalanced-accuracy headline, or a code-vs-claims mismatch (the reported result is optimistic or not reproducible); MINOR when cross-validation was sound and only the cross-fold summary, the operating point, or a class-aware metric is missing from the write-up.
|
|
45
45
|
|
|
46
|
+
**AO6 — Arm-defining task vs deployment workflow (construct validity of the evaluation)**:
|
|
47
|
+
- Distinct from AO3 (model-task ≠ human-task *framing*) and from scope-coherence (claim ≠ result): AO6 asks whether the **task that defines the study arms mirrors the deployment workflow the claim targets**, or an artificial handicap/selection. Two recurrent failure modes:
|
|
48
|
+
- (a) **Handicapped arm** — the AI (or comparator) arm is operationalized in a way the real workflow never imposes: e.g., AI read in pure blind interpretation while the actual deployment provides clinical context / priors / the report, so the evaluation measures a task no one performs.
|
|
49
|
+
- (b) **Success-conditioned selection** — the arm or the analyzed subset is gated on an AI-success condition (cases where the model produced an output, segmentations that "passed", studies the pipeline did not fail on), so the comparison is conditioned on the very thing under test.
|
|
50
|
+
- This is a **design/paradigm-level** defect: the operationalized task, not the prose, is mis-specified, so it cannot be fixed by rewording the claim — escalate **past an ordinary Major** (editors read it as a Reject-grade construct-validity failure; a panel that files it as a fixable Major under-rates it). The fix is a re-designed arm whose task matches the intended deployment workflow and an unconditioned (consecutive / intention-to-diagnose) analysis set.
|
|
51
|
+
|
|
46
52
|
## Decision-impact / early-deployment probes (DECIDE-AI axis, DI1–DI5)
|
|
47
53
|
|
|
48
54
|
Co-apply when a study claims **clinical utility, deployment, or decision impact** of an AI system, or *is* an early-stage live clinical evaluation. The reporting axis is then DECIDE-AI (early-stage clinical evaluation of AI decision-support); these probes check that a utility/deployment claim rests on real-use evidence, not retrospective accuracy. They sharpen AO4 for the deployment-evaluation case.
|
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
- self-review: Anticipated Major / Minor Comments (Fatal / Fixable) mapped to category letters.
|
|
7
7
|
Do NOT edit one copy only — run `python3 scripts/check_domain_probe_sync.py --sync`. -->
|
|
8
8
|
|
|
9
|
-
# Diagnostic-accuracy / reader-study probes (D1–
|
|
9
|
+
# Diagnostic-accuracy / reader-study probes (D1–D8)
|
|
10
10
|
|
|
11
11
|
A checklist for **diagnostic test accuracy (DTA) primary studies** — an index test against a reference standard, including **multi-reader multi-case (MRMC)** reader studies (e.g., AI-vs-reader or modality-comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target the biases QUADAS-2 names and the MRMC design/variance issues a reader study adds. Pairs the `analyze-stats` `table-standards/table-types/reader_study.md` table and the `make-figures` `exemplar_plots/mrmc_roc.md` figure. (For a DTA **meta-analysis**, use `sr_ma.md`.)
|
|
12
12
|
|
|
@@ -38,9 +38,14 @@ A checklist for **diagnostic test accuracy (DTA) primary studies** — an index
|
|
|
38
38
|
- Typical signatures: "patients were included if [index score] ≥ T" in a paper whose aim is to evaluate that index; enrolling on a positive screening test to then "validate" the screening test; defining the diseased group by the same reader/algorithm output under study.
|
|
39
39
|
- This is **design-level, not a reporting fix** — escalate past an ordinary Major (a co-reviewer / editor reads it as a fatal selection/spectrum artifact). The fix is a reference standard and an enrollment criterion that are independent of the index test (a consecutive suspected-disease series), not a re-analysis.
|
|
40
40
|
|
|
41
|
+
**D8 — Exclusion flow-diagram ↔ Methods-prose consistency + modality-safety enumeration**:
|
|
42
|
+
- Cross-check the **exclusion criteria drawn in the participant flow diagram** (STROBE/STARD flow) against the **exclusion list in the Methods prose**. A criterion that appears in one but not the other (a count in the flow with no prose rationale, or a prose exclusion not reflected in the flow boxes) is a reporting inconsistency a co-reviewer catches by reading the two side by side.
|
|
43
|
+
- For an **imaging** study, check that **modality-specific safety contraindications** and **device/artifact exclusions** are enumerated where applicable: MR safety (pacemaker/implant, claustrophobia), iodinated/gadolinium **contrast** contraindications (renal function, allergy, pregnancy), and **image-quality/artifact** exclusions (motion, metal artifact, incomplete coverage). Silent omission of these categories in a prospective imaging cohort understates the selected spectrum.
|
|
44
|
+
- Severity: a flow-vs-prose exclusion mismatch is MAJOR when it changes the analytic-N or the eligible spectrum; missing modality-safety/artifact exclusion categories is MINOR–MAJOR depending on how much of the source population they remove. The fix is a reconciled exclusion list (flow == prose) plus an explicit modality-safety/artifact exclusion enumeration.
|
|
45
|
+
|
|
41
46
|
**Output template (D2 / D6 example)**:
|
|
42
47
|
> "The study uses a case-control (two-gate) design — confirmed cases versus healthy controls — rather than a consecutive series of patients in whom the diagnosis was suspected. This typically overestimates accuracy and does not reflect the intended-use spectrum, so I'd read the reported sensitivity/specificity as proof-of-concept rather than clinical accuracy, and suggest tempering the Abstract accordingly. Separately, the reader study reports a single reader-averaged AUC; because readers are a sample, I'd suggest an MRMC analysis (e.g., Obuchowski–Rockette) that accounts for both reader and case variance, with per-reader estimates shown and the unit of analysis (per-patient vs per-lesion) stated."
|
|
43
48
|
|
|
44
|
-
**Discipline — leads vs findings (applies to D1–
|
|
49
|
+
**Discipline — leads vs findings (applies to D1–D8)**:
|
|
45
50
|
- A verification/blinding/spectrum concern from a quick scan is a **lead until Methods and the participant flow are read together** — distinguish under-reporting (ask to clarify) from a true design bias (MAJOR).
|
|
46
51
|
- Anchor each comment to the exact bias (partial vs differential verification; single- vs two-gate; reader-averaged vs fixed-reader; per-patient vs per-lesion) and the location. Keep severity tied to what the flaw does: two-gate sampling, incorporation bias, or ignoring reader variance is design/analysis-level (MAJOR, often Major #1); an unreported reading-order detail is a clarify-request.
|