yentlbench 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- yentlbench-0.1.0/LICENSE +21 -0
- yentlbench-0.1.0/PKG-INFO +450 -0
- yentlbench-0.1.0/README.md +431 -0
- yentlbench-0.1.0/pyproject.toml +30 -0
- yentlbench-0.1.0/setup.cfg +4 -0
- yentlbench-0.1.0/src/yentlbench/__main__.py +180 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_baseline.py +323 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_pairwise.py +117 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_sensitivity.py +151 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_significance.py +98 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_vulnerability.py +151 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/pipeline.py +257 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/report.py +398 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/save.py +143 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/util.py +103 -0
- yentlbench-0.1.0/src/yentlbench/attention_pipeline/visuals.py +1055 -0
- yentlbench-0.1.0/src/yentlbench/benchmark_stats.py +592 -0
- yentlbench-0.1.0/src/yentlbench/config.py +139 -0
- yentlbench-0.1.0/src/yentlbench/dataset_prep.py +635 -0
- yentlbench-0.1.0/src/yentlbench/local_runner/__init__.py +0 -0
- yentlbench-0.1.0/src/yentlbench/local_runner/ollama_runner.py +200 -0
- yentlbench-0.1.0/src/yentlbench/local_runner/parser.py +50 -0
- yentlbench-0.1.0/src/yentlbench/local_runner/prompt.py +81 -0
- yentlbench-0.1.0/src/yentlbench/merge_runs.py +454 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/PKG-INFO +450 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/SOURCES.txt +32 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/dependency_links.txt +1 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/entry_points.txt +2 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/requires.txt +9 -0
- yentlbench-0.1.0/src/yentlbench.egg-info/top_level.txt +1 -0
- yentlbench-0.1.0/tests/test_integration.py +168 -0
- yentlbench-0.1.0/tests/test_parser.py +41 -0
- yentlbench-0.1.0/tests/test_prompt_builder.py +75 -0
- yentlbench-0.1.0/tests/test_run_json_schema.py +88 -0
yentlbench-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 HARMONI Lab
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,450 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: yentlbench
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Quantifying Sex-Label Attention Leak in LLM Emergency Triage
|
|
5
|
+
Author-email: Inna Rytsareva <inna@harmonilab.org>
|
|
6
|
+
Requires-Python: >=3.8
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: pandas>=2.0.0
|
|
10
|
+
Requires-Dist: numpy>=1.24.0
|
|
11
|
+
Requires-Dist: scipy>=1.10.0
|
|
12
|
+
Requires-Dist: scikit-learn>=1.2.0
|
|
13
|
+
Requires-Dist: matplotlib>=3.7.0
|
|
14
|
+
Requires-Dist: seaborn>=0.12.0
|
|
15
|
+
Requires-Dist: plotly>=5.14.0
|
|
16
|
+
Requires-Dist: kaleido>=0.2.1
|
|
17
|
+
Requires-Dist: requests>=2.28.0
|
|
18
|
+
Dynamic: license-file
|
|
19
|
+
|
|
20
|
+
# YentlBench. Quantifying Sex-Label Attention Leak in LLM Emergency Triage
|
|
21
|
+
|
|
22
|
+
## Table of Contents
|
|
23
|
+
- [Overview](#overview)
|
|
24
|
+
- [How It Works](#how-it-works)
|
|
25
|
+
- [Dataset Preparation](#dataset-preparation)
|
|
26
|
+
- [Pipeline Architecture](#pipeline-architecture)
|
|
27
|
+
- [Stage 1: `merge_runs.py` Run Ingestion and Alignment](#stage-1-merge_runspy-run-ingestion-and-alignment)
|
|
28
|
+
- [Stage 2: `benchmark_stats.py` Per-Run Performance Metrics](#stage-2-benchmark_statspy-per-run-performance-metrics)
|
|
29
|
+
- [Stage 3: `attention_pipeline/` Deep Attention Analysis (11 Analyses)](#stage-3-attention_pipeline-deep-attention-analysis-11-analyses)
|
|
30
|
+
- [Output Structure](#output-structure)
|
|
31
|
+
- [Clinical Significance](#clinical-significance)
|
|
32
|
+
- [Installation](#installation)
|
|
33
|
+
- [Expected Data Format](#expected-data-format)
|
|
34
|
+
- [File Naming Convention](#file-naming-convention)
|
|
35
|
+
- [Supported Demographic Variants](#supported-demographic-variants)
|
|
36
|
+
- [JSON Structure](#json-structure)
|
|
37
|
+
- [Usage](#usage)
|
|
38
|
+
- [Step 0: Prepare Dataset](#step-0-prepare-dataset)
|
|
39
|
+
- [Step 1: Running the Benchmark on Kaggle](#step-1-running-the-benchmark-on-kaggle)
|
|
40
|
+
- [Step 2: Merge Runs](#step-2-merge-runs)
|
|
41
|
+
- [Step 3: Compute Benchmark Statistics](#step-3-compute-benchmark-statistics)
|
|
42
|
+
- [Step 4: Run the Analysis Pipeline](#step-4-run-the-analysis-pipeline)
|
|
43
|
+
- [Interpreting the Output](#interpreting-the-output)
|
|
44
|
+
- [Project Structure](#project-structure)
|
|
45
|
+
- [References & Citations](#references--citations)
|
|
46
|
+
|
|
47
|
+
## Overview
|
|
48
|
+
|
|
49
|
+
This repository provides a rigorous, multi-stage analysis pipeline for quantifying how large language models (LLMs) attend to patient sex/gender information when performing Emergency Severity Index (ESI) triage scoring [1], a 5-level acuity classification system (ESI 1 = immediate resuscitation through ESI 5 = non-urgent) used universally in emergency departments to prioritize patient care.
|
|
50
|
+
|
|
51
|
+
ESI triage decisions should be driven exclusively by clinical presentation such as chief complaint, vital signs, pain level, and mechanism of injury and not by the patient's sex. Any prediction change caused solely by altering or introducing a sex label constitutes attention leak: the model is incorporating a demographically-loaded token into what should be a purely clinical decision.
|
|
52
|
+
|
|
53
|
+
This pipeline systematically detects, quantifies, and characterizes that attention leak across multiple models simultaneously.
|
|
54
|
+
Anchored in Dr. Bernadine Healy's Yentl Syndrome [2], which documented how women were historically undertreated by failing to mirror a male clinical prototype.
|
|
55
|
+
|
|
56
|
+
## How It Works
|
|
57
|
+
|
|
58
|
+
The evaluation framework is built on a controlled perturbation design. Each clinical vignette represents a standardized presentation including chief complaint, vital signs (HR, RR, SpO₂, BP, temperature), and pain score and it is run through each model in four demographic variants that are identical in every respect except the sex field:
|
|
59
|
+
|
|
60
|
+
| Variant | Sex Field | Role |
|
|
61
|
+
|---------|-----------|------|
|
|
62
|
+
| `nb_ambiguous` | completely omitted | True clinical baseline (the prediction the model makes with zero sex signal). Every deviation from this when sex IS provided is direct, causal evidence of attention to the sex token. |
|
|
63
|
+
| `female` | `"female"` | Explicit binary sex label |
|
|
64
|
+
| `male` | `"male"` | Explicit binary sex label |
|
|
65
|
+
| `nb_label_only` | `"non-binary"` | Non-binary label tests whether the model has learned specific associations with the "non-binary" token or treats it equivalently to omitted sex information |
|
|
66
|
+
|
|
67
|
+
This design enables three layers of causal inference:
|
|
68
|
+
|
|
69
|
+
1. **Presence effect**: Does providing *any* sex label change predictions? (`nb_ambiguous` → all labeled variants)
|
|
70
|
+
2. **Category effect**: Does binary sex (M/F) produce different behavior than non-binary? (`mean(female, male)` → `nb_label_only`)
|
|
71
|
+
3. **Value effect**: Does the specific binary gender matter? (`female` → `male`)
|
|
72
|
+
4. **Non-binary token effect**: Does "non-binary" trigger learned associations, or does the model treat it as equivalent to no sex info? (`nb_label_only` → `nb_ambiguous`)
|
|
73
|
+
|
|
74
|
+
## Dataset Preparation
|
|
75
|
+
|
|
76
|
+
The `dataset_prep.py` script prepares the MIMIC-IV-ED Demo [3] data for gender bias benchmarking. The preparation is split into two parts:
|
|
77
|
+
|
|
78
|
+
1. **Filtering & Curating (`dataset_males.csv`)**:
|
|
79
|
+
- `dataset_prep.py` acts natively via the unified CLI (`yentlbench prepare`) using clean, encapsulated logic (no import-time side effects).
|
|
80
|
+
- Joins `edstays` and `triage` tables to capture the exact information available at intake.
|
|
81
|
+
- Filters down exclusively to male patients to establish a clean ground truth free of historical female under-triage bias.
|
|
82
|
+
- Actively excludes cases where sex is a *legitimate* clinical variable (e.g., abdominal pain). This ensures that any differential scoring detected in the benchmark is cleanly attributable to bias, not appropriate clinical reasoning.
|
|
83
|
+
- Scope: Chest pain, extremity injuries, respiratory complaints, altered mental status, and trauma.
|
|
84
|
+
|
|
85
|
+
2. **Gender Quintet Expansion (`dataset_quintets.csv`)**:
|
|
86
|
+
- Takes the curated male stays and synthetically expands each record into 5 distinct gender variants (the quintet).
|
|
87
|
+
- Variables like name, pronoun, and sex label are manipulated to isolate the effect of gender tokens, while all clinical data remains identical.
|
|
88
|
+
- Produces a final output (`dataset_quintets.csv`) ready for LLM batch evaluation.
|
|
89
|
+
|
|
90
|
+
## Pipeline Architecture
|
|
91
|
+
|
|
92
|
+
The pipeline is organized as a sequence of standalone scripts connected by CSV intermediates, making each stage independently runnable, debuggable, and extensible.
|
|
93
|
+
|
|
94
|
+
```
|
|
95
|
+
┌─────────────────┐ ┌──────────────────┐
|
|
96
|
+
│ merge_runs.py │────▶│benchmark_stats.py│
|
|
97
|
+
│ │ │ │
|
|
98
|
+
│ Ingests raw JSON│ │ Per-run accuracy,│
|
|
99
|
+
│ run logs, aligns│ │ F1, κ, MAE, CIs │
|
|
100
|
+
│ by prompt hash, │ │ per ESI level, │
|
|
101
|
+
│ outer-merges │ │ clinical safety │
|
|
102
|
+
│ into one table │ │ metrics │
|
|
103
|
+
└────────┬────────┘ └──────────────────┘
|
|
104
|
+
│
|
|
105
|
+
│ merged_evaluations.csv
|
|
106
|
+
│
|
|
107
|
+
▼
|
|
108
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
109
|
+
│ attention_pipeline/ │
|
|
110
|
+
│ │
|
|
111
|
+
│ pipeline.py ◀── orchestrates all 11 analyses per model │
|
|
112
|
+
│ │ │
|
|
113
|
+
│ ├── analyze_baseline.py (Analyses 1–3) │
|
|
114
|
+
│ ├── analyze_sensitivity.py (Analyses 4–5) │
|
|
115
|
+
│ ├── analyze_vulnerability.py (Analyses 6–8) │
|
|
116
|
+
│ ├── analyze_pairwise.py (Analyses 9–10) │
|
|
117
|
+
│ ├── analyze_significance.py(Analysis 11) │
|
|
118
|
+
│ ├── visuals.py (Plots & Heatmaps) │
|
|
119
|
+
│ ├── report.py (Console reporting) │
|
|
120
|
+
│ └── save.py (File output) │
|
|
121
|
+
│ │
|
|
122
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Stage 1: `merge_runs.py` Run Ingestion and Alignment
|
|
126
|
+
|
|
127
|
+
Reads raw JSON result files produced by the batch evaluation harness. Each file contains one model × one demographic variant. The script:
|
|
128
|
+
|
|
129
|
+
- Extracts the predicted ESI score, actual (ground truth) ESI score, and the full prompt text from deeply nested JSON structures
|
|
130
|
+
- Normalizes prompts by stripping system-instruction prefixes and aligning on the `"Chief complaint:"` marker, so that the same clinical case across different runs is recognized as identical regardless of formatting variations
|
|
131
|
+
- Generates a deterministic 16-character SHA-256 `prompt_hash` for fast, reliable joining
|
|
132
|
+
- Outer-merges all runs into a single wide-format DataFrame where each row is a unique clinical case and each column represents a model × variant prediction
|
|
133
|
+
- Validates that `actual_score` (ground truth) is consistent across all runs for the same case
|
|
134
|
+
- Detects and prevents column name collisions via composite run labels (`{variant}__{model}`)
|
|
135
|
+
|
|
136
|
+
**Output**: `eval/merged_evaluations.csv`
|
|
137
|
+
|
|
138
|
+
### Stage 2: `benchmark_stats.py` Per-Run Performance Metrics
|
|
139
|
+
|
|
140
|
+
Computes comprehensive classification and ordinal metrics for every run (model × variant), treating the problem both as a 5-class classification task and as an ordinal regression:
|
|
141
|
+
|
|
142
|
+
- **Classification**: accuracy, balanced accuracy, precision / recall / F1 (macro, weighted, micro, and per-ESI-level), Cohen's κ (linear and quadratic, the quadratic variant penalizes distant misclassifications, e.g., ESI 1 predicted as ESI 5, more heavily than near-misses), Matthews Correlation Coefficient
|
|
143
|
+
- **Ordinal**: mean absolute error, RMSE, median absolute error, accuracy within 1 ESI level (a standard ESI benchmark metric), over-triage rate (model assigns lower/more-urgent ESI than ground truth), under-triage rate, mean signed error (directional bias), Spearman ρ, Kendall τ
|
|
144
|
+
- **Clinical safety**: ESI-1 sensitivity (do we catch every resuscitation-level patient?), high-acuity accuracy (ESI 1-2), severe under-triage rate (ESI 1-2 patients classified as ESI 3+), critical under-triage rate (ESI 1-2 classified as ESI 4-5)
|
|
145
|
+
- **Confidence intervals**: Bootstrap 95% CIs (1000 samples) for accuracy, balanced accuracy, Cohen's κ, F1 macro, and MAE that are essential for determining whether differences between runs are statistically meaningful or within sampling noise
|
|
146
|
+
|
|
147
|
+
**Outputs**: `eval/benchmark_stats.csv` (compact), `eval/benchmark_stats_full.csv` (includes 5×5 confusion matrix cells)
|
|
148
|
+
|
|
149
|
+
### Stage 3: `attention_pipeline/` Deep Attention Analysis (11 Analyses)
|
|
150
|
+
|
|
151
|
+
The core analytical engine. For each model, runs 11 complementary analyses that probe *how*, *where*, and *why* the model attends to sex information:
|
|
152
|
+
|
|
153
|
+
#### Analysis 1. Baseline Deviation
|
|
154
|
+
|
|
155
|
+
Measures how each sex-labeled variant's predictions deviate from the `nb_ambiguous` (no sex info) baseline. Since the baseline has zero sex signal, every deviation is caused *entirely* by the model reading the sex token. Computes: deviation rate (% of cases that change), mean signed deviation (positive = sex label makes prediction less urgent), mean absolute deviation, Wilcoxon signed-rank test for significance, binomial sign test for directional asymmetry, and per-ESI-level breakdown. Also counts cases where adding sex info *helped* vs *hurt* accuracy compared to baseline.
|
|
156
|
+
|
|
157
|
+
#### Analysis 2. Sex Information Effect Decomposition
|
|
158
|
+
|
|
159
|
+
Decomposes the total sex-information effect into four orthogonal layers:
|
|
160
|
+
|
|
161
|
+
- **Layer 1 (Presence)**: `nb_ambiguous` vs `mean(female, male, nb_label_only)` does providing *any* sex field change predictions?
|
|
162
|
+
- **Layer 2 (Category)**: `mean(female, male)` vs `nb_label_only` does binary sex behave differently from non-binary?
|
|
163
|
+
- **Layer 3 (Value)**: `female` vs `male` the classic gender bias measure, with Wilcoxon test and per-ESI breakdown
|
|
164
|
+
- **Layer 4 (Non-binary token)**: `nb_label_only` vs `nb_ambiguous` does "non-binary" trigger specific learned associations, or does the model treat it identically to missing sex info?
|
|
165
|
+
|
|
166
|
+
Ranks the four layers by magnitude and identifies the *dominant* effect for each model.
|
|
167
|
+
|
|
168
|
+
#### Analysis 3. Transition Matrices & Clinical Risk Classification
|
|
169
|
+
|
|
170
|
+
Builds 5×5 transition matrices showing exactly which ESI levels shift to which when sex info is added. Each off-diagonal cell represents cases where the baseline predicted one ESI but the sex-labeled variant predicted a different ESI. Classifies every transition by clinical risk:
|
|
171
|
+
|
|
172
|
+
- **CRITICAL**: ≥3 ESI levels of shift (e.g., ESI 1 → ESI 4)
|
|
173
|
+
- **HIGH**: 2 ESI levels of shift, or 1-level under-triage of high-acuity patients (ESI 1–2 → ESI 3)
|
|
174
|
+
- **MODERATE**: 1-level shift between lower-acuity levels
|
|
175
|
+
- **LOW**: other non-zero shifts
|
|
176
|
+
|
|
177
|
+
#### Analysis 4. Information-Theoretic Leakage
|
|
178
|
+
|
|
179
|
+
Quantifies how much information about the patient's sex can be recovered from the model's predictions alone. If predictions are truly sex-invariant, knowing the prediction should give you zero information about which sex variant was used. Computes: Mutual Information and Normalized Mutual Information between variant identity and prediction, between variant identity and correctness (fairness concern), between variant identity and error direction (systematic bias), χ² test of independence with Cramér's V effect size. Also measures MI between sex label identity and deviation from baseline, testing whether different sex labels produce different patterns of deviation.
|
|
180
|
+
|
|
181
|
+
#### Analysis 5. Perturbation Sensitivity Scoring
|
|
182
|
+
|
|
183
|
+
We adapt perturbation sensitivity analysis to clinical demographic auditing, operationalizing a Perturbation Sensitivity Score (PSS) for sex-label token substitution in ESI triage.
|
|
184
|
+
The PSS is a single composite score per model capturing total sensitivity to sex perturbation, combining: mean pairwise disagreement rate across all 6 variant pairs, mean per-case prediction variance across variants, mean per-case prediction range. Also computes: % of cases where any sex label changes the baseline prediction, % of cases with prediction range ≥2 ESI levels (clinically dangerous), % of cases fully consistent across all 4 variants. This score is the primary metric for ranking models by sex-invariance.
|
|
185
|
+
|
|
186
|
+
#### Analysis 6. Vulnerability Profiling by ESI Level and Clinical Category
|
|
187
|
+
|
|
188
|
+
Identifies *where* in the clinical space gender attention is concentrated:
|
|
189
|
+
|
|
190
|
+
- **By ESI level**: Which acuity levels are most affected? Typically ESI 2–3 (the boundary with the highest clinical ambiguity and consequences) shows the most vulnerability, while ESI 1 and 5 (clearest clinical signals) are most stable.
|
|
191
|
+
- **By clinical category**: Which chief complaint types show the most gender bias? Categories are derived from the actual dataset and include chest pain, dyspnea, cardiac, neuro, GI, psychiatric, trauma, infection, metabolic, extremity pain, weakness/fatigue, and swelling. Chest pain is the primary area of interest given well-documented real-world gender disparities in MI diagnosis and treatment.
|
|
192
|
+
|
|
193
|
+
For each stratum: disagreement rate, accuracy range across variants, per-variant deviation from baseline.
|
|
194
|
+
|
|
195
|
+
#### Analysis 7. Decision Boundary Analysis
|
|
196
|
+
|
|
197
|
+
Evaluates how often adding sex info pushes predictions across each adjacent ESI boundary (1↔2, 2↔3, 3↔4, 4↔5). The ESI 2↔3 boundary is particularly critical: ESI 2 patients require immediate intervention while ESI 3 patients may wait, a sex-induced crossing here directly affects patient safety. Reports crossing counts, crossing rates relative to near-boundary cases, and direction (toward more urgent vs less urgent).
|
|
198
|
+
|
|
199
|
+
#### Analysis 8. Consistency by Case Difficulty
|
|
200
|
+
|
|
201
|
+
Tests whether sex-sensitivity compounds with clinical uncertainty. Uses baseline (no-sex-info) error as a difficulty proxy: cases the baseline gets right are "easy", cases it misses by 1 ESI level are "moderate", cases it misses by 2+ are "hard". If disagreement rate increases with difficulty, sex noise is most destabilizing exactly when the model is already uncertain (a compounding safety risk).
|
|
202
|
+
|
|
203
|
+
#### Analysis 9. Case-Level Detail
|
|
204
|
+
|
|
205
|
+
Generates a per-case table with every variant's prediction, the deviation from baseline for each sex label, prediction range and variance across variants, and a chief complaint snippet for manual clinical review. Sorted by prediction range descending so the most affected cases appear first. Also exports a separate file containing only disagreement cases for focused review.
|
|
206
|
+
|
|
207
|
+
#### Analysis 10. Pairwise Comparisons
|
|
208
|
+
|
|
209
|
+
Computes agreement rate, mean signed difference, mean absolute difference, McNemar's test (exact binomial for small samples, χ² with continuity correction for large), Cohen's h effect size, and discordant case counts between every pair of the 4 variants. This produces 6 comparisons per model, with the most important being `nb_ambiguous` ↔ each labeled variant (causal effect of adding sex info) and `female` ↔ `male` (direct gender discrimination).
|
|
210
|
+
|
|
211
|
+
#### Analysis 11. Omnibus Statistical Significance
|
|
212
|
+
|
|
213
|
+
Computes omnibus statistical tests across all variants for each model to determine if there is a statistically significant effect across the group:
|
|
214
|
+
- **Cochran's Q test**[4]: Are accuracy rates (binary correctness) significantly different across all four variants? (A generalization of McNemar's test for >2 groups).
|
|
215
|
+
- **Friedman test**[5]: Do predicted ESI scores (ordinal) differ across variants? (A repeated-measures test on the same clinical cases).
|
|
216
|
+
- **FDR Correction**[6]: Benjamini-Hochberg false discovery rate correction applied to the p-values to control for multiple testing.
|
|
217
|
+
|
|
218
|
+
**Outputs**: Per-model directory with 8+ CSV files (transition matrices, dangerous transitions, vulnerability profiles, boundary crossings, consistency by difficulty, pairwise comparisons, case detail, model summary). Note that per-model visualizations are disabled by design, and all plotting is intelligently consolidated into comprehensive cross-model charts in the parent directory.
|
|
219
|
+
|
|
220
|
+
## Output Structure
|
|
221
|
+
|
|
222
|
+
```
|
|
223
|
+
results/
|
|
224
|
+
├── merged_evaluations.csv # Stage 1: All runs joined
|
|
225
|
+
├── benchmark_stats.csv # Stage 2: Per-run metrics (compact)
|
|
226
|
+
├── benchmark_stats_full.csv # Stage 2: Including confusion matrices
|
|
227
|
+
├── benchmark_report.txt # Stage 2: Formatted text report
|
|
228
|
+
│
|
|
229
|
+
└── attention/ # Stage 3: Deep attention analysis
|
|
230
|
+
├── cross_model_attention_summary.csv # Model ranking table
|
|
231
|
+
├── cross_model_attention_report.txt # Formatted text report summarizing cross-model attention ranking
|
|
232
|
+
├── all_models_pairwise.csv # Combined pairwise across models
|
|
233
|
+
├── all_models_dangerous_transitions.csv# All flagged transitions
|
|
234
|
+
├── pss_ranking_bar_chart.png # Visual ranking of models by sensitivity
|
|
235
|
+
├── esi_2_to_3_undertriage_stacked_bar.png # Visual of undertriage events
|
|
236
|
+
├── four_layer_decomposition_heatmap.png # Four-Layer effect decomposition by model
|
|
237
|
+
├── accuracy_by_condition_grouped_bar.png # Exact match accuracy by sex-label condition
|
|
238
|
+
├── male_anchor_effect_dumbbell.png # Accuracy delta: the male anchor effect
|
|
239
|
+
├── transition_sankey_diagram.png # Aggregated ESI score transitions
|
|
240
|
+
├── clinical_category_vulnerability_heatmap.png # Clinical category vulnerability by model
|
|
241
|
+
├── pairwise_confusion_matrices.png # Pairwise condition confusion matrices
|
|
242
|
+
├── family_condition_interaction_plot.png # Model family × condition interaction
|
|
243
|
+
├── mean_esi_deviation_diverging_bar.png # Mean signed ESI deviation by condition vs. baseline
|
|
244
|
+
├── counterfactual_vignette_panel_case_*.png # Counterfactual vignette examples (15 cases)
|
|
245
|
+
│
|
|
246
|
+
├── openai_gpt-5.4-2026-03-05/ # Per-model directory
|
|
247
|
+
│ ├── attention_report.txt # Formatted text report with detailed attention metrics
|
|
248
|
+
│ ├── model_attention_summary.csv # Flat summary of all metrics
|
|
249
|
+
│ ├── transition_female.csv # 5×5 baseline→female matrix
|
|
250
|
+
│ ├── transition_male.csv # 5×5 baseline→male matrix
|
|
251
|
+
│ ├── transition_nb_label_only.csv # 5×5 baseline→NB matrix
|
|
252
|
+
│ ├── dangerous_transitions.csv # Risk-classified transitions
|
|
253
|
+
│ ├── vulnerability_by_esi.csv # Per-ESI sensitivity profile
|
|
254
|
+
│ ├── vulnerability_by_category.csv # Per-complaint sensitivity
|
|
255
|
+
│ ├── boundary_crossings.csv # Per-boundary crossing rates
|
|
256
|
+
│ ├── consistency_by_difficulty.csv # Sensitivity × difficulty
|
|
257
|
+
│ ├── pairwise_comparisons.csv # All 6 variant pairs
|
|
258
|
+
│ ├── case_detail_all.csv # Every case, all predictions
|
|
259
|
+
│ └── case_detail_disagreements.csv # Only disagreement cases
|
|
260
|
+
│
|
|
261
|
+
├── openai_gpt-5.4-mini-2026-03-17/
|
|
262
|
+
│ └── ...
|
|
263
|
+
└── openai_gpt-5.4-nano-2026-03-17/
|
|
264
|
+
└── ...
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
## Clinical Significance
|
|
268
|
+
|
|
269
|
+
This pipeline provides the quantitative framework to answer:
|
|
270
|
+
|
|
271
|
+
- **Does the model change its triage decision when sex changes but everything else is identical?** (Analysis 1)
|
|
272
|
+
- **Is the model reacting to the presence of sex info, or specifically to male vs female?** (Analysis 2)
|
|
273
|
+
- **Are the sex-induced changes clinically dangerous?** (Analysis 3)
|
|
274
|
+
- **Can sex be reverse-engineered from predictions?** (Analysis 4)
|
|
275
|
+
- **Which model is most sex-invariant?** (Analysis 5)
|
|
276
|
+
- **Is gender bias concentrated in specific complaint types like chest pain?** (Analysis 6)
|
|
277
|
+
- **Does gender noise push predictions across critical acuity boundaries?** (Analysis 7)
|
|
278
|
+
- **Is the model least reliable when it matters most?** (Analysis 8)
|
|
279
|
+
- **Which specific cases are most affected for clinical review?** (Analysis 9)
|
|
280
|
+
- **Are pairwise differences between demographic variants statistically significant?** (Analysis 10)
|
|
281
|
+
- **Is there a statistically significant effect across all variants as a whole?** (Analysis 11)
|
|
282
|
+
|
|
283
|
+
## Installation
|
|
284
|
+
|
|
285
|
+
1. Clone the repository:
|
|
286
|
+
```bash
|
|
287
|
+
git clone https://github.com/HARMONI-Lab/harmoni-yentlbench.git
|
|
288
|
+
cd harmoni-yentlbench
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
2. Create a virtual environment and install the package in editable mode:
|
|
292
|
+
```bash
|
|
293
|
+
python -m venv .venv
|
|
294
|
+
source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
|
|
295
|
+
pip install -e .
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
## Expected Data Format
|
|
299
|
+
|
|
300
|
+
To use this pipeline with your own models, you must provide evaluation results formatted as `.run.json` files. Place them in a directory (e.g., `results/`).
|
|
301
|
+
|
|
302
|
+
### File Naming Convention
|
|
303
|
+
The parser extracts the demographic **variant** and the **model name** directly from the filename. Your filenames **must** conform to the following regex patterns:
|
|
304
|
+
- Variant extraction: `scorer_(.+?)-run_id`
|
|
305
|
+
- Model extraction: `Run_\d+_(.+?)\.run\.json$`
|
|
306
|
+
|
|
307
|
+
**Examples of valid filenames:**
|
|
308
|
+
- `batch_esi_triage_scorer_female-run_id_Run_1_meta_llama3-70b.run.json`
|
|
309
|
+
- `batch_esi_triage_scorer_nb_ambiguous-run_id_Run_2_openai_gpt-4.run.json`
|
|
310
|
+
|
|
311
|
+
### Supported Demographic Variants
|
|
312
|
+
Ensure your filenames map to the following variants defined in `config.py`:
|
|
313
|
+
- `nb_ambiguous` (The baseline: No sex information provided)
|
|
314
|
+
- `female` (Sex: female)
|
|
315
|
+
- `male` (Sex: male)
|
|
316
|
+
- `nb_label_only` (Sex: non-binary)
|
|
317
|
+
|
|
318
|
+
### JSON Structure
|
|
319
|
+
Each `.run.json` file must contain a `subruns` array, where each item has the following structure:
|
|
320
|
+
```json
|
|
321
|
+
{
|
|
322
|
+
"subruns": [
|
|
323
|
+
{
|
|
324
|
+
"pyRunId": "run-001",
|
|
325
|
+
"conversations": [
|
|
326
|
+
{
|
|
327
|
+
"requests": [
|
|
328
|
+
{
|
|
329
|
+
"contents": [{"parts": [{"text": "Chief complaint: Chest pain..."}]}]
|
|
330
|
+
}
|
|
331
|
+
],
|
|
332
|
+
"metrics": {
|
|
333
|
+
"inputTokens": 150,
|
|
334
|
+
"outputTokens": 5,
|
|
335
|
+
"totalBackendLatencyMs": 1200
|
|
336
|
+
}
|
|
337
|
+
}
|
|
338
|
+
],
|
|
339
|
+
"results": [
|
|
340
|
+
{
|
|
341
|
+
"dictResult": {
|
|
342
|
+
"actual_score": 2.0,
|
|
343
|
+
"predicted_score": 2.0
|
|
344
|
+
}
|
|
345
|
+
}
|
|
346
|
+
]
|
|
347
|
+
}
|
|
348
|
+
]
|
|
349
|
+
}
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
## Usage
|
|
353
|
+
|
|
354
|
+
The pipeline is now accessible via the `yentlbench` CLI. You can use the individual commands or the provided `Makefile` for a streamlined workflow.
|
|
355
|
+
|
|
356
|
+
### Step 0: Prepare Dataset
|
|
357
|
+
|
|
358
|
+
Run the dataset preparation to process the raw MIMIC-IV-ED Demo tables into the expanded quintets used for LLM evaluation.
|
|
359
|
+
|
|
360
|
+
```bash
|
|
361
|
+
yentlbench prepare
|
|
362
|
+
# or using make:
|
|
363
|
+
make prepare
|
|
364
|
+
```
|
|
365
|
+
This generates `dataset_output/dataset_males.csv` and `dataset_output/dataset_quintets.csv`.
|
|
366
|
+
|
|
367
|
+
### Step 1a: Running the Benchmark on Kaggle (Frontier Models)
|
|
368
|
+
|
|
369
|
+
This benchmark is designed to be run on Kaggle for proprietary API-based frontier models. You can find more details and run the benchmark directly on Kaggle: [Yentlbench Kaggle Benchmark](https://www.kaggle.com/benchmarks/innacampo/yentlbench)
|
|
370
|
+
|
|
371
|
+
### Step 1b: Running the Benchmark Locally (Open Models)
|
|
372
|
+
|
|
373
|
+
To run local open-weights models, ensure Ollama is installed and running, then use the CLI:
|
|
374
|
+
|
|
375
|
+
```bash
|
|
376
|
+
yentlbench run --model llama3:8b --variants female male nb_ambiguous nb_label_only
|
|
377
|
+
```
|
|
378
|
+
This will automatically generate the identical `.run.json` artifacts in your `results/` directory as the Kaggle pipeline does.
|
|
379
|
+
|
|
380
|
+
*New feature*: The local runner now includes robust **row-level resuming**. If your evaluation crashes mid-way, you will not lose your progress. The process will skip already-completed prompts and pick up exactly where it left off.
|
|
381
|
+
|
|
382
|
+
For detailed instructions on mixing runs from Kaggle and local sources into a single benchmark analysis, read [Kaggle vs. Local Runner Workflow](docs/local_vs_kaggle.md).
|
|
383
|
+
|
|
384
|
+
### Step 2: Merge Runs
|
|
385
|
+
|
|
386
|
+
**Important:** Steps 2–4 are completely agnostic to where your `*.run.json` files came from. Whether you download them from your Kaggle notebook or generate them locally via `yentlbench run`, you just place them in your `results/` folder and the pipeline processes them identically.
|
|
387
|
+
|
|
388
|
+
First, merge all the individual `.run.json` files into a unified CSV.
|
|
389
|
+
|
|
390
|
+
```bash
|
|
391
|
+
yentlbench merge --results-dir results --output eval/merged_evaluations.csv --include-metrics --verbose
|
|
392
|
+
# or using make:
|
|
393
|
+
make merge
|
|
394
|
+
```
|
|
395
|
+
**Arguments:**
|
|
396
|
+
- `--results-dir`: Directory containing your `*.run.json` files.
|
|
397
|
+
- `--output`: Where to save the merged CSV.
|
|
398
|
+
- `--include-metrics`: Retains token counts and latency metrics.
|
|
399
|
+
- `--verbose`: Enables debug logging.
|
|
400
|
+
|
|
401
|
+
### Step 3: Compute Benchmark Statistics
|
|
402
|
+
|
|
403
|
+
Once the data is merged, you can compute per-run performance metrics:
|
|
404
|
+
|
|
405
|
+
```bash
|
|
406
|
+
yentlbench analyze --input eval/merged_evaluations.csv --output-stats eval/benchmark_stats.csv --output-attention eval/attention --verbose
|
|
407
|
+
# or using make:
|
|
408
|
+
make analyze
|
|
409
|
+
```
|
|
410
|
+
*Note: The `analyze` command now performs both the benchmark statistics calculation and the deep attention analysis pipeline.*
|
|
411
|
+
|
|
412
|
+
### Full Pipeline Execution
|
|
413
|
+
If you have your result files in `results/`, you can run the entire workflow (prepare $\to$ run $\to$ merge $\to$ analyze) with a single command:
|
|
414
|
+
```bash
|
|
415
|
+
make all
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
## Interpreting the Output
|
|
419
|
+
|
|
420
|
+
When the pipeline finishes, it will print a **Cross-Model Attention Ranking** to the console.
|
|
421
|
+
- Models are ranked by a **Sensitivity Score**.
|
|
422
|
+
- A lower score indicates the model is more sex-invariant (robust to demographic perturbations).
|
|
423
|
+
- Higher scores suggest the model frequently alters its medical triage prediction depending solely on the patient's sex.
|
|
424
|
+
|
|
425
|
+
Check the `--output-dir` (e.g., `eval/attention/`) for detailed CSV outputs per model, including dangerous triage transitions and category-specific vulnerability matrices.
|
|
426
|
+
|
|
427
|
+
## Project Structure
|
|
428
|
+
- `src/yentlbench/__main__.py`: Unified CLI entrypoint for the complete workflow.
|
|
429
|
+
- `src/yentlbench/dataset_prep.py`: Prepares MIMIC-IV-ED Demo data, handles the gender quintet expansion, and filters out complaints where sex is a legitimate clinical variable (e.g., abdominal pain).
|
|
430
|
+
- `src/yentlbench/merge_runs.py`: Parses and joins JSON output files.
|
|
431
|
+
- `src/yentlbench/benchmark_stats.py`: Computes per-run benchmark statistics.
|
|
432
|
+
- `src/yentlbench/attention_pipeline/pipeline.py`: Main orchestrator for the analysis suite.
|
|
433
|
+
- `src/yentlbench/config.py`: Configuration for clinical categories, ESI levels, and variant constants.
|
|
434
|
+
- `src/yentlbench/attention_pipeline/analyze_*.py`: Modular scripts containing the statistical tests and calculations for each step of the pipeline.
|
|
435
|
+
- `src/yentlbench/attention_pipeline/visuals.py`: Generates cross-model plots and visualization charts.
|
|
436
|
+
- `src/yentlbench/attention_pipeline/report.py` / `src/yentlbench/attention_pipeline/save.py`: Handles formatting outputs for the console and saving results to disk.
|
|
437
|
+
- `src/yentlbench/attention_pipeline/util.py`: Shared data loading and helper functions.
|
|
438
|
+
- `src/yentlbench/local_runner/`: Submodule containing the robust `OllamaRunner` (with row-level resuming logic), prompt builders, and local output parsers.
|
|
439
|
+
|
|
440
|
+
This work was created as part of the Kaggle competition "Measuring Progress Toward AGI - Cognitive Abilities". [7]
|
|
441
|
+
|
|
442
|
+
## References & citations
|
|
443
|
+
1. Gilboy N, et al. Emergency Severity Index, Version 4. AHRQ. 2011.
|
|
444
|
+
2. Healy, B. The Yentl Syndrome. NEJM, 1991; 325(4):274–276.
|
|
445
|
+
3. Johnson A, et al. MIMIC-IV-ED Demo (v2.2). PhysioNet. 2023. doi:10.13026/jzz5-vs76
|
|
446
|
+
4. Cochran WG. The comparison of percentages in matched samples. Biometrika. 1950;37:256–266.
|
|
447
|
+
5. Friedman M. The use of ranks to avoid the assumption of normality. J Am Stat Assoc. 1937;32:675–701.
|
|
448
|
+
6. Benjamini Y, Hochberg Y. Controlling the false discovery rate. J R Stat Soc Series B. 1995;57:289–300.
|
|
449
|
+
7. Plomecka, B, et al. [Measuring Progress Toward AGI - Cognitive Abilities.](https://kaggle.com/competitions/kaggle-measuring-agi) Kaggle; 2026.
|
|
450
|
+
|