npm - medsci-skills - Versions diffs - 4.1.0 - Mend

medsci-skills 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (702) hide show

package/skills/deidentify/references/korean_phi_patterns.md ADDED Viewed

@@ -0,0 +1,135 @@
+# Korean PHI Patterns
+> Locale: Korean. This reference intentionally contains Korean — it documents the Korean PHI-detection patterns (the `kr` locale feature of `/deidentify`). See `docs/locale_inventory.md`.
+Regex patterns and column-name dictionaries used by `deidentify.py`.
+This file serves as documentation; the actual patterns are in the Python script.
+## Value-Level Regex Patterns
+### 주민등록번호 (Resident Registration Number)
+```
+\d{6}-[1-4]\d{6}
+```
+- Format: `YYMMDD-GNNNNNN`
+- First 6 digits: birthdate (YYMMDD)
+- 7th digit (G): gender + birth century (1=1900s male, 2=1900s female, 3=2000s male, 4=2000s female)
+- Remaining 6: region code + serial + check digit
+- Example: `850315-1234567`
+### 외국인등록번호 (Alien Registration Number)
+```
+\d{6}-[5-8]\d{6}
+```
+- Same format as 주민번호 but 7th digit is 5-8
+- Covered by the same regex when expanded to `[1-8]`
+### 전화번호 — 휴대전화 (Mobile Phone)
+```
+01[016789]-?\d{3,4}-?\d{4}
+```
+- Prefixes: 010, 011, 016, 017, 018, 019
+- Dashes optional
+- Examples: `010-1234-5678`, `01012345678`, `011-234-5678`
+### 전화번호 — 유선전화 (Landline)
+```
+0[2-6][0-9]{0,2}-?\d{3,4}-?\d{4}
+```
+- Area codes: 02 (Seoul), 031-033 (Gyeonggi), 041-044 (Chungcheong), 051-055 (Gyeongsang), 061-064 (Jeolla/Jeju)
+- Examples: `02-555-1234`, `031-765-4321`, `051-234-5678`
+### 이메일 (Email)
+```
+[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
+```
+- Standard email pattern
+- Common Korean domains: naver.com, daum.net, hanmail.net, kakao.com
+### 날짜 (Date)
+ISO format:
+```
+(19|20)\d{2}[-/.](0[1-9]|1[0-2])[-/.](0[1-9]|[12]\d|3[01])
+```
+Korean format:
+```
+(19|20)\d{2}년\s*(0?[1-9]|1[0-2])월\s*(0?[1-9]|[12]\d|3[01])일
+```
+Short (YYMMDD):
+```
+([5-9]\d|0[0-4])(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])
+```
+### 주소 (Address)
+Korean address suffix pattern:
+```
+(특별시|광역시|특별자치시|특별자치도|도\s|시\s|군\s|구\s|읍\s|면\s|동\s|리\s|로\s|길\s)
+```
+- Matches administrative divisions that appear in Korean addresses
+- Detection threshold: >30% of values in a column contain these patterns
+### 한국인 이름 (Korean Name)
+```
+[\uAC00-\uD7AF]{2,4}
+```
+- 2 to 4 Hangul syllable characters
+- **CRITICAL**: Only applied to columns identified as name-type by column name
+- Common Korean words (정상, 양성, 음성, etc.) also match this pattern
+- False positive mitigation: match against column name hints only
+## Column Name Dictionary
+### Name-type columns
+`환자명`, `성명`, `이름`, `성함`, `patient_name`, `patientname`, `pt_name`, `name`,
+`first_name`, `last_name`
+### RRN-type columns
+`주민번호`, `주민등록번호`, `ssn`, `social_security`
+### Date-type columns
+`생년월일`, `생년`, `출생일`, `dob`, `date_of_birth`, `birth_date`, `birthdate`
+### Phone-type columns
+`전화번호`, `연락처`, `핸드폰`, `휴대폰`, `휴대전화`, `자택전화`,
+`phone`, `telephone`, `mobile`, `phone_number`, `cell`
+### Address-type columns
+`주소`, `자택주소`, `거주지`, `address`, `home_address`, `street`,
+`zip`, `zipcode`, `zip_code`
+### Email-type columns
+`이메일`, `email`, `email_address`
+### ID-type columns
+`차트번호`, `등록번호`, `환자번호`, `의무기록번호`, `원무번호`,
+`mrn`, `medical_record`, `chart_no`, `patient_id`, `patientid`,
+`chart_number`, `record_number`, `hospital_id`
+### Insurance-type columns
+`보험번호`, `insurance_no`, `insurance_number`
+## High-Cardinality Numeric Detection
+Columns not matching any name pattern but containing:
+- >90% pure numeric values (digits only)
+- Length >= 5 digits
+- >80% unique values
+These are flagged as potential MRN/chart numbers for researcher review.

package/skills/deidentify/skill.yml ADDED Viewed

@@ -0,0 +1,43 @@
+schema_version: 2
+name: deidentify
+layer: A
+owner_domain: data_preparation
+when_to_use: "De-identify clinical CSV/TSV/Excel data locally before any LLM-assisted analysis."
+when_NOT_to_use: "General data cleaning or type fixing (use clean-data). Never to have the agent itself read, paste, or process raw PHI."
+inputs:
+  - path: "raw clinical data file (CSV / TSV / XLSX)"
+    schema: csv
+    required: true
+outputs:
+  - path: "de-identified data file"
+  - path: "scan report (column classifications, no raw values)"
+  - path: "audit log (SHA-256 hashes only)"
+deterministic_scripts:
+  - deidentify.py
+side_effects:
+  - writes_deidentified_data
+  - runs_locally_no_network
+downstream_consumers:
+  - clean-data
+  - analyze-stats
+forbidden_actions:
+  - read_raw_phi
+  - display_mapping_file
+  - send_phi_to_llm
+# v2.1 quality card
+purpose: "Detect and remove PHI from clinical tabular data with a local-only CLI, so downstream LLM analysis never touches raw identifiers."
+safety_boundaries:
+  - "The agent never sees raw PHI; de-identification runs in a standalone local script with no network or AI calls."
+  - "Never reads or displays the re-identification mapping file (it holds original PHI values)."
+  - "Only the scan report (no raw values), the hash-only audit log, and the de-identified output may be read."
+known_limitations:
+  - "Regex and heuristic detection across 10 country locale packs is not a substitute for expert disclosure review or an IRB determination."
+  - "PHI coverage is limited to the bundled locale packs (kr, us, jp, cn, de, uk, fr, ca, au, in)."
+validation_commands:
+  - "python deidentify.py scan <file> --locale <code>   # review column classifications before stripping"
+  - "inspect the SHA-256 audit log after a full run"
+evidence_surface: bundled_script

package/skills/deidentify/tests/README.md ADDED Viewed

@@ -0,0 +1,26 @@
+# Deidentify Test Fixtures
+All CSV files in this directory contain **synthetic test data only**.
+No real patient or person is represented.
+- **Names** (`김철수`, `이영희`, `박민수`, etc.) are common Korean placeholder
+  names equivalent to "John Doe" / "Jane Doe" in English. They were chosen
+  precisely because they are generic enough to be unattributable to any
+  real individual.
+- **RRN (주민번호)** values follow the public format specification but the
+  digits are arbitrary and do not validate against the official checksum
+  algorithm used by the Korean civil registry.
+- **Phone numbers**, **addresses**, **emails**, **chart numbers**, and
+  **diagnoses** are all constructed for the purpose of exercising the
+  PHI detector regexes shipped with `/deidentify`.
+These fixtures exist to verify that the de-identifier:
+1. Detects the PHI patterns the skill claims to detect.
+2. Leaves non-PHI fields (clinical measurements, dates of routine
+   nature) untouched.
+3. Handles edge cases (mixed date formats, half-width vs full-width
+   digits, comma vs newline separators, missing fields).
+If you need to add a new fixture, follow the same rule: every value must
+be either a published format example or a constructed synthetic string.
+Never copy real EMR data into this directory, even for one-off debugging.

package/skills/deidentify/tests/test_clean.csv ADDED Viewed

@@ -0,0 +1,16 @@
+sample_id,measurement_a,measurement_b,category,score,group,result
+S001,45.2,12.3,A,78,control,positive
+S002,38.7,15.6,B,82,treatment,negative
+S003,52.1,9.8,A,91,control,positive
+S004,41.5,11.2,C,67,treatment,negative
+S005,36.9,14.7,B,73,control,positive
+S006,48.3,10.5,A,85,treatment,positive
+S007,39.6,13.1,C,69,control,negative
+S008,55.0,8.9,B,94,treatment,positive
+S009,42.8,12.0,A,76,control,negative
+S010,37.4,16.2,C,88,treatment,positive
+S011,50.5,11.8,A,71,control,negative
+S012,44.1,13.5,B,83,treatment,positive
+S013,46.7,10.1,C,79,control,positive
+S014,40.2,14.0,A,86,treatment,negative
+S015,53.8,9.3,B,92,control,positive

package/skills/deidentify/tests/test_edge_cases.csv ADDED Viewed

@@ -0,0 +1,11 @@
+col_a,col_b,비고,숫자값,빈컬럼,긴한글단어
+값1,123,,45.6,,대한민국
+,456,일반 메모,78.9,,
+값3,,환자 김철수 연락처 010-1234-5678,12.3,,서울특별시강남구테헤란로
+값4,789,"콤마, 포함 값",0,,
+,,,,,
+값6,012,정상 소견,99.9,,가나다라마바사아자차카타파하
+값7,345,2024년 3월 15일 검사,55.5,,
+값8,678,,33.3,,AB
+값9,901,"따옴표 ""포함"" 값",77.7,,정상
+값10,234,특이사항 없음,44.4,,대한외과학회

package/skills/deidentify/tests/test_phi_korean.csv ADDED Viewed

@@ -0,0 +1,11 @@
+환자명,주민번호,전화번호,생년월일,주소,이메일,차트번호,진단,측정값,비고
+김철수,850315-1234567,010-1234-5678,1985-03-15,서울특별시 강남구 테헤란로 123,chulsu@example.com,12345678,위암,45.2,정기검진
+이영희,901220-2345678,010-9876-5432,1990-12-20,부산광역시 해운대구 우동 456,yhlee@test.kr,23456789,유방암,38.7,
+박민수,780505-1567890,02-555-1234,1978-05-05,대구광역시 수성구 범어동 789-12,minsoo.park@hospital.org,34567890,폐결절,12.3,재검 필요
+정미영,880101-2678901,010-5555-9999,1988년 1월 1일,인천광역시 남동구 구월동 321,miyoung@naver.com,45678901,정상,67.8,
+최준호,950730-1789012,051-234-5678,1995.07.30,광주광역시 서구 화정동 654,junho.choi@gmail.com,56789012,고혈압,120.5,약물 복용 중
+홍길동,700101-1890123,010-1111-2222,1970-01-01,경기도 성남시 분당구 판교로 100,gildong@daum.net,67890123,당뇨,200.3,합병증 주의
+강수현,920415-2901234,010-3333-4444,1992.04.15,세종특별자치시 아름동 555,suhyun.kang@outlook.com,78901234,갑상선결절,2.1,
+윤대한,830620-1012345,031-765-4321,1983-06-20,충청남도 천안시 동남구 신방동 888,daehan@company.co.kr,89012345,정상,55.0,건강
+임소연,960808-2123456,010-7777-8888,1996년 8월 8일,전라북도 전주시 완산구 효자동 222,soyeon.lim@yonsei.ac.kr,90123456,빈혈,8.5,추적검사
+한지민,750225-2234567,042-333-9876,1975-02-25,제주특별자치도 제주시 연동 777,jimin.han@korea.ac.kr,01234567,대장용종,15.0,내시경 추적

package/skills/design-ai-benchmarking/SKILL.md ADDED Viewed

@@ -0,0 +1,214 @@
+---
+name: design-ai-benchmarking
+description: >
+  Design and validity review for studies that benchmark one or more AI systems against a human-expert
+  panel as the reference. Covers the evaluation question and arm definition, decoupled multi-dimensional
+  rubrics with anchors, planted calibration probes, reviewer-panel construction, inter-rater reliability
+  targets, LLM-as-judge versus human-as-judge adjudication, construct-independence guards, and a
+  structured rating-export schema. Use before data collection on an AI-vs-expert evaluation.
+triggers: AI benchmarking, AI vs human expert, reader study design, expert panel evaluation, LLM-as-judge, AI evaluation rubric, model benchmark design, human baseline comparison, AI-output rating, evaluation rubric design
+tools: Read, Write, Edit, Bash, Grep, Glob
+model: inherit
+---
+# Design-AI-Benchmarking Skill
+## Purpose
+This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that
+the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
+reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where
+`/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI
+system(s) to a panel of human experts (or to each other) on rated outputs.
+Use it when:
+- one or more AI systems will be scored against a human-expert reference (reader study, annotation
+  panel, AI-output evaluation, model-vs-model bench)
+- a rubric and rating protocol must be locked before reviewers begin
+- a benchmark feels vulnerable to "the highest score is just the most tautological item" or
+  "low agreement, but we cannot tell why" criticism
+- a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias
+Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such
+as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`);
+or reviewing an already-written manuscript (use `/peer-review` or `/self-review`).
+---
+## Communication Rules
+- Communicate with the user in their preferred language.
+- Use English for statistical, machine-learning, and reporting-guideline terminology.
+- Be direct about evaluation-validity risks, but always propose the smallest feasible fix first.
+- Never invent reviewer ratings, reference labels, or agreement statistics; those come from collected
+  data only.
+---
+## Standard Output
+```text
+## AI-Benchmark Design Review
+Evaluation question: ...
+Arms / systems compared: ...
+Reference (human-expert panel): ...
+Unit of rating: (item / case / output)
+### Rubric (decoupled dimensions)
+- dimension -> construct -> anchors (1..k)
+### Calibration probes (blinded, randomized)
+- positive-control / known-bad / instability / mechanism-contradiction
+### Reviewer panel
+- n reviewers, metadata captured, per-reviewer randomized order
+### Reliability plan
+- overall IRR target + control-item IRR (reported separately)
+### Judge strategy
+- human-as-judge / LLM-as-judge / both + adjudication rule
+### Validity risks
+1. ...
+### Minimal fixes
+- ...
+### Decision
+- Ready to collect / Needs rubric revision / Needs arm or judge redesign
+```
+---
+## Workflow
+### Phase 1: Define the evaluation question and arms
+Pin down, in writing:
+- the exact claim the benchmark must support (e.g., "system A's outputs are perceptually
+  indistinguishable from expert outputs", not "system A is deployment-ready")
+- every arm/system being compared, and what each arm receives as input (same items, same information
+  access, same output format) so no arm has a hidden advantage
+- the human-expert reference: who they are, and whether they set ground truth, provide a comparison
+  arm, or both
+- the unit of rating (item, case, output) and how many units each reviewer sees
+**Gate:** Present the reconstructed evaluation question, arms, and reference to the user and confirm
+before designing the rubric. A wrong reconstruction misdirects the entire benchmark.
+### Phase 2: Design a decoupled multi-dimensional rubric
+- **Decouple the axes.** Each rated dimension measures one construct. Keep "is the output valid/correct"
+  separate from "is it novel", "is it feasible/measurable", "does it add value over current tools", and
+  "would it change action". A candidate can be high-validity yet low-added-value ("real but redundant");
+  a single blended score hides this divergence.
+- **Anchor every scale point** with a short verbal descriptor; pilot the anchors with at least one
+  reviewer before locking.
+- **Pre-specify discriminant validity**: hypothesize which dimensions should correlate vs be orthogonal,
+  then report the full inter-dimension correlation matrix to confirm the rubric measures distinct
+  constructs.
+- A worked rubric template lives in `${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md`.
+### Phase 3: Insert and randomize calibration probes
+Plant a small number of deliberate control items, blinded and randomized across raters (record who
+received which via a `probe_arm` flag), to (i) anchor the scale, (ii) measure rater drift/fatigue, and
+(iii) audit the rubric and pipeline itself. Four useful flavors:
+- **Positive control / "too-good" item** — a known-strong or near-tautological item; tests whether
+  raters equate "largest effect" with "best", and whether the construct-independence gate (Phase 7) works.
+- **Known-bad negative control** — an engineered defect (fabricated reference, missing key statistic);
+  expected to score low.
+- **Instability item** — an estimate that reverses or fails to replicate on a holdout; tests
+  caveat-handling.
+- **Mechanism-contradiction item** — an empirical direction that opposes the proposed mechanism.
+Probes are *planted or adjudicated*, never fabricated to fit a hypothesis.
+### Phase 4: Construct the reviewer panel
+- Recruit reviewers spanning the intended expertise gradient; pre-specify any expertise stratification.
+- Capture reviewer metadata (years of experience, prior AI-evaluation experience, subspecialty) for
+  descriptive reporting and stratified analysis.
+- Randomize item order **per reviewer** (not one global seed) and record the order; plan to analyze
+  order and fatigue effects.
+- Require each item to be judged standalone; discourage cross-item references in free-text, which signal
+  non-independent rating.
+**Gate:** Present the panel composition, stratification, and randomization plan for user review before
+recruitment is finalized.
+### Phase 5: Set inter-rater reliability targets
+- Pre-specify the agreement statistic (e.g., ICC for continuous ratings, weighted kappa for ordinal)
+  and a target with justification.
+- **Report reliability on the planted control items separately** as primary evidence of rubric and
+  scale validity. A low overall ICC is interpretable only if raters at least converge on the controls;
+  surfacing both numbers prevents "low agreement => bad rubric" or "bad raters" misreads.
+- Plan the minimum ratings-per-item needed for a stable agreement estimate (delegate the math to
+  `/analyze-stats`).
+### Phase 6: Choose the judge strategy and adjudication
+- Decide human-as-judge, LLM-as-judge, or both. If an LLM is used as a judge, treat it as one more arm
+  whose ratings must themselves be validated against the human panel on the control items.
+- Pre-specify the **adjudication rule** for disagreement (e.g., majority, a third senior reviewer,
+  consensus discussion) and who adjudicates.
+- Blind judges to arm identity wherever feasible; record any unavoidable unblinding.
+### Phase 7: Construct-independence and leakage guards
+- Exclude any predictor or input that is a definitional component of the outcome (mathematical
+  definition), and flag near-tautological composites built from the outcome's defining components — they
+  produce an inflated, near-circular result and belong as labeled probes, not discoveries.
+- Verify no arm sees post-decision or outcome-derived information the others do not.
+- Confirm the reference labels were not derived from the same model output being evaluated.
+### Phase 8: Lock a structured export schema
+Define the machine-readable rating record up front: per-item ratings across every rubric dimension,
+free-text justifications, follow-up flags, the `probe_arm` flag, reviewer id and metadata, item order,
+and timing. A synthetic schema lives in `${CLAUDE_SKILL_DIR}/references/benchmark_export_schema.json`.
+**Gate:** Present the final rubric, probe set, panel plan, judge strategy, and export schema together;
+collect explicit user approval before any rating begins. Locking these before data collection is the
+whole point — changes afterward compromise the comparison.
+---
+## Handoff Rules
+- route to `/analyze-stats` for ICC / weighted kappa / DeLong, agreement sample size, and effect-size
+  real-world translation of the benchmark results
+- route to `/check-reporting` for STARD-AI, CLAIM, or TRIPOD+AI item-level reporting once the design is locked
+- route to `/design-study` when the broader study around the benchmark (cohort logic, analysis unit,
+  comparator) also needs review
+- route to `/peer-review` or `/self-review` only after ratings exist and a manuscript is being assessed
+---
+## What This Skill Does NOT Do
+- It does not compute agreement statistics or run analyses directly (that is `/analyze-stats`).
+- It does not collect or fabricate ratings, reference labels, or probe outcomes.
+- It does not draft manuscript prose or run a reporting-guideline audit.
+- It does not replace a full peer review of a finished manuscript.
+## Anti-Hallucination
+- **Never fabricate references.** All citations must be verified via `/search-lit` with a confirmed DOI
+  or PMID. Mark unverified references as `[UNVERIFIED - NEEDS MANUAL CHECK]`.
+- **Never invent reviewer ratings, agreement statistics, reference labels, or probe outcomes** — these
+  come from collected data only. A reported ICC, kappa, or score with no underlying rating record is the
+  failure mode this skill exists to prevent.
+- **Never invent clinical definitions, diagnostic criteria, or guideline recommendations.** If uncertain,
+  flag with `[VERIFY]` and ask the user.
+- If a reporting-guideline item, journal policy, or evaluation standard is uncertain, state the
+  uncertainty rather than guessing.
+## Reference Files
+- `${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md` -- a synthetic, decoupled
+  multi-dimension rating rubric with anchors and a planted-probe column.
+- `${CLAUDE_SKILL_DIR}/references/benchmark_export_schema.json` -- a synthetic JSON schema for the
+  per-item rating export (ratings, justifications, probe_arm, reviewer metadata, order, timing).

package/skills/design-ai-benchmarking/references/benchmark_export_schema.json ADDED Viewed

@@ -0,0 +1,69 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "AI-vs-expert benchmark rating export (synthetic template)",
+  "description": "One record per (reviewer, item) rating. Lock this schema before collecting ratings. All example values are illustrative placeholders, not real data.",
+  "type": "object",
+  "required": ["reviewer_id", "item_id", "arm", "probe_arm", "order_index", "ratings", "justification"],
+  "properties": {
+    "reviewer_id": {
+      "type": "string",
+      "description": "Stable pseudonymous id for the reviewer (no personal identifiers).",
+      "examples": ["R01"]
+    },
+    "reviewer_metadata": {
+      "type": "object",
+      "description": "Captured once per reviewer; repeated here for convenience.",
+      "properties": {
+        "years_experience": {"type": "integer", "minimum": 0},
+        "prior_ai_evaluation": {"type": "boolean"},
+        "subspecialty": {"type": "string"}
+      }
+    },
+    "item_id": {"type": "string", "examples": ["item_0042"]},
+    "arm": {
+      "type": "string",
+      "description": "Which system produced the rated output, or the human-expert reference arm.",
+      "examples": ["system_a", "system_b", "expert_reference"]
+    },
+    "probe_arm": {
+      "type": ["string", "null"],
+      "description": "Non-null when this is a planted control item.",
+      "enum": ["pos_control", "neg_control", "instability", "mechanism_contra", null]
+    },
+    "order_index": {
+      "type": "integer",
+      "minimum": 0,
+      "description": "Position in this reviewer's randomized item order (for fatigue analysis)."
+    },
+    "ratings": {
+      "type": "object",
+      "description": "One score per decoupled rubric dimension.",
+      "required": ["validity", "novelty", "feasibility", "added_value", "actionability"],
+      "properties": {
+        "validity": {"type": "integer", "minimum": 1, "maximum": 5},
+        "novelty": {"type": "integer", "minimum": 1, "maximum": 5},
+        "feasibility": {"type": "integer", "minimum": 1, "maximum": 5},
+        "added_value": {"type": "integer", "minimum": 1, "maximum": 5},
+        "actionability": {"type": "integer", "minimum": 1, "maximum": 5}
+      }
+    },
+    "justification": {
+      "type": "string",
+      "description": "Free text, judged standalone; no cross-item references."
+    },
+    "follow_up": {
+      "type": ["string", "null"],
+      "description": "What additional evidence would change the rating."
+    },
+    "judge_type": {
+      "type": "string",
+      "enum": ["human", "llm"],
+      "description": "Whether the rater is a human expert or an LLM-as-judge arm."
+    },
+    "timing_seconds": {
+      "type": ["number", "null"],
+      "minimum": 0,
+      "description": "Time spent on this item, for fatigue/drift analysis."
+    }
+  }
+}

package/skills/design-ai-benchmarking/references/elicitation_rubric_template.md ADDED Viewed

@@ -0,0 +1,37 @@
+# Decoupled Elicitation Rubric Template (synthetic)
+A starting rubric for an AI-vs-human-expert benchmark. Every dimension measures one construct, so a
+candidate can score high on validity yet low on added value ("real but redundant"). All values below
+are illustrative placeholders, not real data — replace them for your own evaluation.
+## Per-item rating dimensions
+| Dimension | Construct (one only) | Anchor 1 (low) | Anchor 3 (mid) | Anchor 5 (high) |
+|-----------|----------------------|----------------|----------------|-----------------|
+| Validity | Is the output correct against the reference? | Contradicted by reference | Partially supported | Fully supported |
+| Novelty | Is it new vs prior work? | Restates known result | Incremental extension | Genuinely new |
+| Feasibility | Can it be measured/obtained in practice? | Not measurable | Measurable with effort | Routinely measurable |
+| Added value | Does it add over a measure already in use? | Redundant with a routine measure | Marginal gain | Clear gain over current tools |
+| Actionability | Would a clinician act on it for an individual? | Would not change action | Might change action | Would change action |
+Notes:
+- Pilot the anchors with at least one reviewer before locking the scale.
+- Pre-specify which dimensions are expected to correlate (e.g., validity and actionability) vs be
+  orthogonal (e.g., novelty and feasibility); report the inter-dimension correlation matrix afterward.
+## Planted calibration probes
+`probe_arm` marks a control item; it is randomized across reviewers and excluded from the primary
+estimate but reported separately for scale validity.
+| probe_arm | Flavor | What it tests | Expected behavior |
+|-----------|--------|---------------|-------------------|
+| pos_control | Positive / "too-good" (near-tautological) | Whether raters equate "largest effect" with "best"; whether the construct-independence gate fires | High validity, low added value |
+| neg_control | Known-bad (engineered defect) | Whether obvious defects are caught | Low validity |
+| instability | Reverses on holdout | Caveat-handling for unstable estimates | Lower confidence ratings |
+| mechanism_contra | Direction opposes mechanism | Whether raters notice mechanism conflict | Flagged in free-text |
+## Per-item free-text fields
+- justification (required): one or two sentences, judged standalone (no cross-item references)
+- follow_up (optional): what additional evidence would change the rating

package/skills/design-ai-benchmarking/skill.yml ADDED Viewed

@@ -0,0 +1,38 @@
+schema_version: 2
+name: design-ai-benchmarking
+layer: D
+owner_domain: study_design
+when_to_use: "Design or review a study that benchmarks one or more AI systems against a human-expert panel: arm definition, decoupled rubric, calibration probes, reviewer panel, IRR targets, judge adjudication, and a structured export schema, before ratings are collected."
+when_NOT_to_use: "General study/validity review (use design-study); statistical execution such as ICC/DeLong (use analyze-stats); reporting-guideline item audits (use check-reporting); reviewing a finished manuscript (use peer-review or self-review)."
+inputs:
+  - "evaluation question and the AI system(s) / arms to compare"
+  - "candidate outputs or items to be rated"
+  - "available human-expert reviewers and their metadata"
+outputs:
+  - "benchmark design and validity review (decision notes)"
+  - "decoupled rating rubric with anchors and planted calibration probes"
+  - "structured rating-export schema (JSON)"
+side_effects:
+  - writes_decision_notes
+downstream_consumers:
+  - analyze-stats
+  - check-reporting
+  - design-study
+forbidden_actions:
+  - fabricate_reviewer_ratings_or_reference_labels
+  - approve_benchmark_with_unblinded_or_leaky_rubric
+  - report_irr_without_separate_control_item_reliability
+# v2.1 quality card
+purpose: "Surface design and validity risks specific to AI-vs-human-expert benchmarks (rubric coupling, scale calibration, rater independence, judge choice) before data collection begins."
+safety_boundaries:
+  - "Advisory only: writes decision notes plus rubric/schema artifacts, never ratings or analysis results."
+  - "Calibration probes and reference labels are planted or adjudicated, never fabricated to fit a hypothesis."
+known_limitations:
+  - "A design review reduces but cannot eliminate evaluation bias; it does not guarantee a valid benchmark."
+  - "No standalone demo; rubric and panel decisions require domain-expert judgement."
+validation_commands:
+  - "carry the rubric and export schema into analyze-stats for ICC/agreement, then check-reporting (STARD-AI / CLAIM / TRIPOD+AI)"
+evidence_surface: manual_workflow