yentlbench 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. yentlbench-0.1.0/LICENSE +21 -0
  2. yentlbench-0.1.0/PKG-INFO +450 -0
  3. yentlbench-0.1.0/README.md +431 -0
  4. yentlbench-0.1.0/pyproject.toml +30 -0
  5. yentlbench-0.1.0/setup.cfg +4 -0
  6. yentlbench-0.1.0/src/yentlbench/__main__.py +180 -0
  7. yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_baseline.py +323 -0
  8. yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_pairwise.py +117 -0
  9. yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_sensitivity.py +151 -0
  10. yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_significance.py +98 -0
  11. yentlbench-0.1.0/src/yentlbench/attention_pipeline/analyze_vulnerability.py +151 -0
  12. yentlbench-0.1.0/src/yentlbench/attention_pipeline/pipeline.py +257 -0
  13. yentlbench-0.1.0/src/yentlbench/attention_pipeline/report.py +398 -0
  14. yentlbench-0.1.0/src/yentlbench/attention_pipeline/save.py +143 -0
  15. yentlbench-0.1.0/src/yentlbench/attention_pipeline/util.py +103 -0
  16. yentlbench-0.1.0/src/yentlbench/attention_pipeline/visuals.py +1055 -0
  17. yentlbench-0.1.0/src/yentlbench/benchmark_stats.py +592 -0
  18. yentlbench-0.1.0/src/yentlbench/config.py +139 -0
  19. yentlbench-0.1.0/src/yentlbench/dataset_prep.py +635 -0
  20. yentlbench-0.1.0/src/yentlbench/local_runner/__init__.py +0 -0
  21. yentlbench-0.1.0/src/yentlbench/local_runner/ollama_runner.py +200 -0
  22. yentlbench-0.1.0/src/yentlbench/local_runner/parser.py +50 -0
  23. yentlbench-0.1.0/src/yentlbench/local_runner/prompt.py +81 -0
  24. yentlbench-0.1.0/src/yentlbench/merge_runs.py +454 -0
  25. yentlbench-0.1.0/src/yentlbench.egg-info/PKG-INFO +450 -0
  26. yentlbench-0.1.0/src/yentlbench.egg-info/SOURCES.txt +32 -0
  27. yentlbench-0.1.0/src/yentlbench.egg-info/dependency_links.txt +1 -0
  28. yentlbench-0.1.0/src/yentlbench.egg-info/entry_points.txt +2 -0
  29. yentlbench-0.1.0/src/yentlbench.egg-info/requires.txt +9 -0
  30. yentlbench-0.1.0/src/yentlbench.egg-info/top_level.txt +1 -0
  31. yentlbench-0.1.0/tests/test_integration.py +168 -0
  32. yentlbench-0.1.0/tests/test_parser.py +41 -0
  33. yentlbench-0.1.0/tests/test_prompt_builder.py +75 -0
  34. yentlbench-0.1.0/tests/test_run_json_schema.py +88 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 HARMONI Lab
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,450 @@
1
+ Metadata-Version: 2.4
2
+ Name: yentlbench
3
+ Version: 0.1.0
4
+ Summary: Quantifying Sex-Label Attention Leak in LLM Emergency Triage
5
+ Author-email: Inna Rytsareva <inna@harmonilab.org>
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Requires-Dist: pandas>=2.0.0
10
+ Requires-Dist: numpy>=1.24.0
11
+ Requires-Dist: scipy>=1.10.0
12
+ Requires-Dist: scikit-learn>=1.2.0
13
+ Requires-Dist: matplotlib>=3.7.0
14
+ Requires-Dist: seaborn>=0.12.0
15
+ Requires-Dist: plotly>=5.14.0
16
+ Requires-Dist: kaleido>=0.2.1
17
+ Requires-Dist: requests>=2.28.0
18
+ Dynamic: license-file
19
+
20
+ # YentlBench. Quantifying Sex-Label Attention Leak in LLM Emergency Triage
21
+
22
+ ## Table of Contents
23
+ - [Overview](#overview)
24
+ - [How It Works](#how-it-works)
25
+ - [Dataset Preparation](#dataset-preparation)
26
+ - [Pipeline Architecture](#pipeline-architecture)
27
+ - [Stage 1: `merge_runs.py` Run Ingestion and Alignment](#stage-1-merge_runspy-run-ingestion-and-alignment)
28
+ - [Stage 2: `benchmark_stats.py` Per-Run Performance Metrics](#stage-2-benchmark_statspy-per-run-performance-metrics)
29
+ - [Stage 3: `attention_pipeline/` Deep Attention Analysis (11 Analyses)](#stage-3-attention_pipeline-deep-attention-analysis-11-analyses)
30
+ - [Output Structure](#output-structure)
31
+ - [Clinical Significance](#clinical-significance)
32
+ - [Installation](#installation)
33
+ - [Expected Data Format](#expected-data-format)
34
+ - [File Naming Convention](#file-naming-convention)
35
+ - [Supported Demographic Variants](#supported-demographic-variants)
36
+ - [JSON Structure](#json-structure)
37
+ - [Usage](#usage)
38
+ - [Step 0: Prepare Dataset](#step-0-prepare-dataset)
39
+ - [Step 1: Running the Benchmark on Kaggle](#step-1-running-the-benchmark-on-kaggle)
40
+ - [Step 2: Merge Runs](#step-2-merge-runs)
41
+ - [Step 3: Compute Benchmark Statistics](#step-3-compute-benchmark-statistics)
42
+ - [Step 4: Run the Analysis Pipeline](#step-4-run-the-analysis-pipeline)
43
+ - [Interpreting the Output](#interpreting-the-output)
44
+ - [Project Structure](#project-structure)
45
+ - [References & Citations](#references--citations)
46
+
47
+ ## Overview
48
+
49
+ This repository provides a rigorous, multi-stage analysis pipeline for quantifying how large language models (LLMs) attend to patient sex/gender information when performing Emergency Severity Index (ESI) triage scoring [1], a 5-level acuity classification system (ESI 1 = immediate resuscitation through ESI 5 = non-urgent) used universally in emergency departments to prioritize patient care.
50
+
51
+ ESI triage decisions should be driven exclusively by clinical presentation such as chief complaint, vital signs, pain level, and mechanism of injury and not by the patient's sex. Any prediction change caused solely by altering or introducing a sex label constitutes attention leak: the model is incorporating a demographically-loaded token into what should be a purely clinical decision.
52
+
53
+ This pipeline systematically detects, quantifies, and characterizes that attention leak across multiple models simultaneously.
54
+ Anchored in Dr. Bernadine Healy's Yentl Syndrome [2], which documented how women were historically undertreated by failing to mirror a male clinical prototype.
55
+
56
+ ## How It Works
57
+
58
+ The evaluation framework is built on a controlled perturbation design. Each clinical vignette represents a standardized presentation including chief complaint, vital signs (HR, RR, SpO₂, BP, temperature), and pain score and it is run through each model in four demographic variants that are identical in every respect except the sex field:
59
+
60
+ | Variant | Sex Field | Role |
61
+ |---------|-----------|------|
62
+ | `nb_ambiguous` | completely omitted | True clinical baseline (the prediction the model makes with zero sex signal). Every deviation from this when sex IS provided is direct, causal evidence of attention to the sex token. |
63
+ | `female` | `"female"` | Explicit binary sex label |
64
+ | `male` | `"male"` | Explicit binary sex label |
65
+ | `nb_label_only` | `"non-binary"` | Non-binary label tests whether the model has learned specific associations with the "non-binary" token or treats it equivalently to omitted sex information |
66
+
67
+ This design enables three layers of causal inference:
68
+
69
+ 1. **Presence effect**: Does providing *any* sex label change predictions? (`nb_ambiguous` → all labeled variants)
70
+ 2. **Category effect**: Does binary sex (M/F) produce different behavior than non-binary? (`mean(female, male)` → `nb_label_only`)
71
+ 3. **Value effect**: Does the specific binary gender matter? (`female` → `male`)
72
+ 4. **Non-binary token effect**: Does "non-binary" trigger learned associations, or does the model treat it as equivalent to no sex info? (`nb_label_only` → `nb_ambiguous`)
73
+
74
+ ## Dataset Preparation
75
+
76
+ The `dataset_prep.py` script prepares the MIMIC-IV-ED Demo [3] data for gender bias benchmarking. The preparation is split into two parts:
77
+
78
+ 1. **Filtering & Curating (`dataset_males.csv`)**:
79
+ - `dataset_prep.py` acts natively via the unified CLI (`yentlbench prepare`) using clean, encapsulated logic (no import-time side effects).
80
+ - Joins `edstays` and `triage` tables to capture the exact information available at intake.
81
+ - Filters down exclusively to male patients to establish a clean ground truth free of historical female under-triage bias.
82
+ - Actively excludes cases where sex is a *legitimate* clinical variable (e.g., abdominal pain). This ensures that any differential scoring detected in the benchmark is cleanly attributable to bias, not appropriate clinical reasoning.
83
+ - Scope: Chest pain, extremity injuries, respiratory complaints, altered mental status, and trauma.
84
+
85
+ 2. **Gender Quintet Expansion (`dataset_quintets.csv`)**:
86
+ - Takes the curated male stays and synthetically expands each record into 5 distinct gender variants (the quintet).
87
+ - Variables like name, pronoun, and sex label are manipulated to isolate the effect of gender tokens, while all clinical data remains identical.
88
+ - Produces a final output (`dataset_quintets.csv`) ready for LLM batch evaluation.
89
+
90
+ ## Pipeline Architecture
91
+
92
+ The pipeline is organized as a sequence of standalone scripts connected by CSV intermediates, making each stage independently runnable, debuggable, and extensible.
93
+
94
+ ```
95
+ ┌─────────────────┐ ┌──────────────────┐
96
+ │ merge_runs.py │────▶│benchmark_stats.py│
97
+ │ │ │ │
98
+ │ Ingests raw JSON│ │ Per-run accuracy,│
99
+ │ run logs, aligns│ │ F1, κ, MAE, CIs │
100
+ │ by prompt hash, │ │ per ESI level, │
101
+ │ outer-merges │ │ clinical safety │
102
+ │ into one table │ │ metrics │
103
+ └────────┬────────┘ └──────────────────┘
104
+
105
+ │ merged_evaluations.csv
106
+
107
+
108
+ ┌─────────────────────────────────────────────────────────────────────┐
109
+ │ attention_pipeline/ │
110
+ │ │
111
+ │ pipeline.py ◀── orchestrates all 11 analyses per model │
112
+ │ │ │
113
+ │ ├── analyze_baseline.py (Analyses 1–3) │
114
+ │ ├── analyze_sensitivity.py (Analyses 4–5) │
115
+ │ ├── analyze_vulnerability.py (Analyses 6–8) │
116
+ │ ├── analyze_pairwise.py (Analyses 9–10) │
117
+ │ ├── analyze_significance.py(Analysis 11) │
118
+ │ ├── visuals.py (Plots & Heatmaps) │
119
+ │ ├── report.py (Console reporting) │
120
+ │ └── save.py (File output) │
121
+ │ │
122
+ └─────────────────────────────────────────────────────────────────────┘
123
+ ```
124
+
125
+ ### Stage 1: `merge_runs.py` Run Ingestion and Alignment
126
+
127
+ Reads raw JSON result files produced by the batch evaluation harness. Each file contains one model × one demographic variant. The script:
128
+
129
+ - Extracts the predicted ESI score, actual (ground truth) ESI score, and the full prompt text from deeply nested JSON structures
130
+ - Normalizes prompts by stripping system-instruction prefixes and aligning on the `"Chief complaint:"` marker, so that the same clinical case across different runs is recognized as identical regardless of formatting variations
131
+ - Generates a deterministic 16-character SHA-256 `prompt_hash` for fast, reliable joining
132
+ - Outer-merges all runs into a single wide-format DataFrame where each row is a unique clinical case and each column represents a model × variant prediction
133
+ - Validates that `actual_score` (ground truth) is consistent across all runs for the same case
134
+ - Detects and prevents column name collisions via composite run labels (`{variant}__{model}`)
135
+
136
+ **Output**: `eval/merged_evaluations.csv`
137
+
138
+ ### Stage 2: `benchmark_stats.py` Per-Run Performance Metrics
139
+
140
+ Computes comprehensive classification and ordinal metrics for every run (model × variant), treating the problem both as a 5-class classification task and as an ordinal regression:
141
+
142
+ - **Classification**: accuracy, balanced accuracy, precision / recall / F1 (macro, weighted, micro, and per-ESI-level), Cohen's κ (linear and quadratic, the quadratic variant penalizes distant misclassifications, e.g., ESI 1 predicted as ESI 5, more heavily than near-misses), Matthews Correlation Coefficient
143
+ - **Ordinal**: mean absolute error, RMSE, median absolute error, accuracy within 1 ESI level (a standard ESI benchmark metric), over-triage rate (model assigns lower/more-urgent ESI than ground truth), under-triage rate, mean signed error (directional bias), Spearman ρ, Kendall τ
144
+ - **Clinical safety**: ESI-1 sensitivity (do we catch every resuscitation-level patient?), high-acuity accuracy (ESI 1-2), severe under-triage rate (ESI 1-2 patients classified as ESI 3+), critical under-triage rate (ESI 1-2 classified as ESI 4-5)
145
+ - **Confidence intervals**: Bootstrap 95% CIs (1000 samples) for accuracy, balanced accuracy, Cohen's κ, F1 macro, and MAE that are essential for determining whether differences between runs are statistically meaningful or within sampling noise
146
+
147
+ **Outputs**: `eval/benchmark_stats.csv` (compact), `eval/benchmark_stats_full.csv` (includes 5×5 confusion matrix cells)
148
+
149
+ ### Stage 3: `attention_pipeline/` Deep Attention Analysis (11 Analyses)
150
+
151
+ The core analytical engine. For each model, runs 11 complementary analyses that probe *how*, *where*, and *why* the model attends to sex information:
152
+
153
+ #### Analysis 1. Baseline Deviation
154
+
155
+ Measures how each sex-labeled variant's predictions deviate from the `nb_ambiguous` (no sex info) baseline. Since the baseline has zero sex signal, every deviation is caused *entirely* by the model reading the sex token. Computes: deviation rate (% of cases that change), mean signed deviation (positive = sex label makes prediction less urgent), mean absolute deviation, Wilcoxon signed-rank test for significance, binomial sign test for directional asymmetry, and per-ESI-level breakdown. Also counts cases where adding sex info *helped* vs *hurt* accuracy compared to baseline.
156
+
157
+ #### Analysis 2. Sex Information Effect Decomposition
158
+
159
+ Decomposes the total sex-information effect into four orthogonal layers:
160
+
161
+ - **Layer 1 (Presence)**: `nb_ambiguous` vs `mean(female, male, nb_label_only)` does providing *any* sex field change predictions?
162
+ - **Layer 2 (Category)**: `mean(female, male)` vs `nb_label_only` does binary sex behave differently from non-binary?
163
+ - **Layer 3 (Value)**: `female` vs `male` the classic gender bias measure, with Wilcoxon test and per-ESI breakdown
164
+ - **Layer 4 (Non-binary token)**: `nb_label_only` vs `nb_ambiguous` does "non-binary" trigger specific learned associations, or does the model treat it identically to missing sex info?
165
+
166
+ Ranks the four layers by magnitude and identifies the *dominant* effect for each model.
167
+
168
+ #### Analysis 3. Transition Matrices & Clinical Risk Classification
169
+
170
+ Builds 5×5 transition matrices showing exactly which ESI levels shift to which when sex info is added. Each off-diagonal cell represents cases where the baseline predicted one ESI but the sex-labeled variant predicted a different ESI. Classifies every transition by clinical risk:
171
+
172
+ - **CRITICAL**: ≥3 ESI levels of shift (e.g., ESI 1 → ESI 4)
173
+ - **HIGH**: 2 ESI levels of shift, or 1-level under-triage of high-acuity patients (ESI 1–2 → ESI 3)
174
+ - **MODERATE**: 1-level shift between lower-acuity levels
175
+ - **LOW**: other non-zero shifts
176
+
177
+ #### Analysis 4. Information-Theoretic Leakage
178
+
179
+ Quantifies how much information about the patient's sex can be recovered from the model's predictions alone. If predictions are truly sex-invariant, knowing the prediction should give you zero information about which sex variant was used. Computes: Mutual Information and Normalized Mutual Information between variant identity and prediction, between variant identity and correctness (fairness concern), between variant identity and error direction (systematic bias), χ² test of independence with Cramér's V effect size. Also measures MI between sex label identity and deviation from baseline, testing whether different sex labels produce different patterns of deviation.
180
+
181
+ #### Analysis 5. Perturbation Sensitivity Scoring
182
+
183
+ We adapt perturbation sensitivity analysis to clinical demographic auditing, operationalizing a Perturbation Sensitivity Score (PSS) for sex-label token substitution in ESI triage.
184
+ The PSS is a single composite score per model capturing total sensitivity to sex perturbation, combining: mean pairwise disagreement rate across all 6 variant pairs, mean per-case prediction variance across variants, mean per-case prediction range. Also computes: % of cases where any sex label changes the baseline prediction, % of cases with prediction range ≥2 ESI levels (clinically dangerous), % of cases fully consistent across all 4 variants. This score is the primary metric for ranking models by sex-invariance.
185
+
186
+ #### Analysis 6. Vulnerability Profiling by ESI Level and Clinical Category
187
+
188
+ Identifies *where* in the clinical space gender attention is concentrated:
189
+
190
+ - **By ESI level**: Which acuity levels are most affected? Typically ESI 2–3 (the boundary with the highest clinical ambiguity and consequences) shows the most vulnerability, while ESI 1 and 5 (clearest clinical signals) are most stable.
191
+ - **By clinical category**: Which chief complaint types show the most gender bias? Categories are derived from the actual dataset and include chest pain, dyspnea, cardiac, neuro, GI, psychiatric, trauma, infection, metabolic, extremity pain, weakness/fatigue, and swelling. Chest pain is the primary area of interest given well-documented real-world gender disparities in MI diagnosis and treatment.
192
+
193
+ For each stratum: disagreement rate, accuracy range across variants, per-variant deviation from baseline.
194
+
195
+ #### Analysis 7. Decision Boundary Analysis
196
+
197
+ Evaluates how often adding sex info pushes predictions across each adjacent ESI boundary (1↔2, 2↔3, 3↔4, 4↔5). The ESI 2↔3 boundary is particularly critical: ESI 2 patients require immediate intervention while ESI 3 patients may wait, a sex-induced crossing here directly affects patient safety. Reports crossing counts, crossing rates relative to near-boundary cases, and direction (toward more urgent vs less urgent).
198
+
199
+ #### Analysis 8. Consistency by Case Difficulty
200
+
201
+ Tests whether sex-sensitivity compounds with clinical uncertainty. Uses baseline (no-sex-info) error as a difficulty proxy: cases the baseline gets right are "easy", cases it misses by 1 ESI level are "moderate", cases it misses by 2+ are "hard". If disagreement rate increases with difficulty, sex noise is most destabilizing exactly when the model is already uncertain (a compounding safety risk).
202
+
203
+ #### Analysis 9. Case-Level Detail
204
+
205
+ Generates a per-case table with every variant's prediction, the deviation from baseline for each sex label, prediction range and variance across variants, and a chief complaint snippet for manual clinical review. Sorted by prediction range descending so the most affected cases appear first. Also exports a separate file containing only disagreement cases for focused review.
206
+
207
+ #### Analysis 10. Pairwise Comparisons
208
+
209
+ Computes agreement rate, mean signed difference, mean absolute difference, McNemar's test (exact binomial for small samples, χ² with continuity correction for large), Cohen's h effect size, and discordant case counts between every pair of the 4 variants. This produces 6 comparisons per model, with the most important being `nb_ambiguous` ↔ each labeled variant (causal effect of adding sex info) and `female` ↔ `male` (direct gender discrimination).
210
+
211
+ #### Analysis 11. Omnibus Statistical Significance
212
+
213
+ Computes omnibus statistical tests across all variants for each model to determine if there is a statistically significant effect across the group:
214
+ - **Cochran's Q test**[4]: Are accuracy rates (binary correctness) significantly different across all four variants? (A generalization of McNemar's test for >2 groups).
215
+ - **Friedman test**[5]: Do predicted ESI scores (ordinal) differ across variants? (A repeated-measures test on the same clinical cases).
216
+ - **FDR Correction**[6]: Benjamini-Hochberg false discovery rate correction applied to the p-values to control for multiple testing.
217
+
218
+ **Outputs**: Per-model directory with 8+ CSV files (transition matrices, dangerous transitions, vulnerability profiles, boundary crossings, consistency by difficulty, pairwise comparisons, case detail, model summary). Note that per-model visualizations are disabled by design, and all plotting is intelligently consolidated into comprehensive cross-model charts in the parent directory.
219
+
220
+ ## Output Structure
221
+
222
+ ```
223
+ results/
224
+ ├── merged_evaluations.csv # Stage 1: All runs joined
225
+ ├── benchmark_stats.csv # Stage 2: Per-run metrics (compact)
226
+ ├── benchmark_stats_full.csv # Stage 2: Including confusion matrices
227
+ ├── benchmark_report.txt # Stage 2: Formatted text report
228
+
229
+ └── attention/ # Stage 3: Deep attention analysis
230
+ ├── cross_model_attention_summary.csv # Model ranking table
231
+ ├── cross_model_attention_report.txt # Formatted text report summarizing cross-model attention ranking
232
+ ├── all_models_pairwise.csv # Combined pairwise across models
233
+ ├── all_models_dangerous_transitions.csv# All flagged transitions
234
+ ├── pss_ranking_bar_chart.png # Visual ranking of models by sensitivity
235
+ ├── esi_2_to_3_undertriage_stacked_bar.png # Visual of undertriage events
236
+ ├── four_layer_decomposition_heatmap.png # Four-Layer effect decomposition by model
237
+ ├── accuracy_by_condition_grouped_bar.png # Exact match accuracy by sex-label condition
238
+ ├── male_anchor_effect_dumbbell.png # Accuracy delta: the male anchor effect
239
+ ├── transition_sankey_diagram.png # Aggregated ESI score transitions
240
+ ├── clinical_category_vulnerability_heatmap.png # Clinical category vulnerability by model
241
+ ├── pairwise_confusion_matrices.png # Pairwise condition confusion matrices
242
+ ├── family_condition_interaction_plot.png # Model family × condition interaction
243
+ ├── mean_esi_deviation_diverging_bar.png # Mean signed ESI deviation by condition vs. baseline
244
+ ├── counterfactual_vignette_panel_case_*.png # Counterfactual vignette examples (15 cases)
245
+
246
+ ├── openai_gpt-5.4-2026-03-05/ # Per-model directory
247
+ │ ├── attention_report.txt # Formatted text report with detailed attention metrics
248
+ │ ├── model_attention_summary.csv # Flat summary of all metrics
249
+ │ ├── transition_female.csv # 5×5 baseline→female matrix
250
+ │ ├── transition_male.csv # 5×5 baseline→male matrix
251
+ │ ├── transition_nb_label_only.csv # 5×5 baseline→NB matrix
252
+ │ ├── dangerous_transitions.csv # Risk-classified transitions
253
+ │ ├── vulnerability_by_esi.csv # Per-ESI sensitivity profile
254
+ │ ├── vulnerability_by_category.csv # Per-complaint sensitivity
255
+ │ ├── boundary_crossings.csv # Per-boundary crossing rates
256
+ │ ├── consistency_by_difficulty.csv # Sensitivity × difficulty
257
+ │ ├── pairwise_comparisons.csv # All 6 variant pairs
258
+ │ ├── case_detail_all.csv # Every case, all predictions
259
+ │ └── case_detail_disagreements.csv # Only disagreement cases
260
+
261
+ ├── openai_gpt-5.4-mini-2026-03-17/
262
+ │ └── ...
263
+ └── openai_gpt-5.4-nano-2026-03-17/
264
+ └── ...
265
+ ```
266
+
267
+ ## Clinical Significance
268
+
269
+ This pipeline provides the quantitative framework to answer:
270
+
271
+ - **Does the model change its triage decision when sex changes but everything else is identical?** (Analysis 1)
272
+ - **Is the model reacting to the presence of sex info, or specifically to male vs female?** (Analysis 2)
273
+ - **Are the sex-induced changes clinically dangerous?** (Analysis 3)
274
+ - **Can sex be reverse-engineered from predictions?** (Analysis 4)
275
+ - **Which model is most sex-invariant?** (Analysis 5)
276
+ - **Is gender bias concentrated in specific complaint types like chest pain?** (Analysis 6)
277
+ - **Does gender noise push predictions across critical acuity boundaries?** (Analysis 7)
278
+ - **Is the model least reliable when it matters most?** (Analysis 8)
279
+ - **Which specific cases are most affected for clinical review?** (Analysis 9)
280
+ - **Are pairwise differences between demographic variants statistically significant?** (Analysis 10)
281
+ - **Is there a statistically significant effect across all variants as a whole?** (Analysis 11)
282
+
283
+ ## Installation
284
+
285
+ 1. Clone the repository:
286
+ ```bash
287
+ git clone https://github.com/HARMONI-Lab/harmoni-yentlbench.git
288
+ cd harmoni-yentlbench
289
+ ```
290
+
291
+ 2. Create a virtual environment and install the package in editable mode:
292
+ ```bash
293
+ python -m venv .venv
294
+ source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
295
+ pip install -e .
296
+ ```
297
+
298
+ ## Expected Data Format
299
+
300
+ To use this pipeline with your own models, you must provide evaluation results formatted as `.run.json` files. Place them in a directory (e.g., `results/`).
301
+
302
+ ### File Naming Convention
303
+ The parser extracts the demographic **variant** and the **model name** directly from the filename. Your filenames **must** conform to the following regex patterns:
304
+ - Variant extraction: `scorer_(.+?)-run_id`
305
+ - Model extraction: `Run_\d+_(.+?)\.run\.json$`
306
+
307
+ **Examples of valid filenames:**
308
+ - `batch_esi_triage_scorer_female-run_id_Run_1_meta_llama3-70b.run.json`
309
+ - `batch_esi_triage_scorer_nb_ambiguous-run_id_Run_2_openai_gpt-4.run.json`
310
+
311
+ ### Supported Demographic Variants
312
+ Ensure your filenames map to the following variants defined in `config.py`:
313
+ - `nb_ambiguous` (The baseline: No sex information provided)
314
+ - `female` (Sex: female)
315
+ - `male` (Sex: male)
316
+ - `nb_label_only` (Sex: non-binary)
317
+
318
+ ### JSON Structure
319
+ Each `.run.json` file must contain a `subruns` array, where each item has the following structure:
320
+ ```json
321
+ {
322
+ "subruns": [
323
+ {
324
+ "pyRunId": "run-001",
325
+ "conversations": [
326
+ {
327
+ "requests": [
328
+ {
329
+ "contents": [{"parts": [{"text": "Chief complaint: Chest pain..."}]}]
330
+ }
331
+ ],
332
+ "metrics": {
333
+ "inputTokens": 150,
334
+ "outputTokens": 5,
335
+ "totalBackendLatencyMs": 1200
336
+ }
337
+ }
338
+ ],
339
+ "results": [
340
+ {
341
+ "dictResult": {
342
+ "actual_score": 2.0,
343
+ "predicted_score": 2.0
344
+ }
345
+ }
346
+ ]
347
+ }
348
+ ]
349
+ }
350
+ ```
351
+
352
+ ## Usage
353
+
354
+ The pipeline is now accessible via the `yentlbench` CLI. You can use the individual commands or the provided `Makefile` for a streamlined workflow.
355
+
356
+ ### Step 0: Prepare Dataset
357
+
358
+ Run the dataset preparation to process the raw MIMIC-IV-ED Demo tables into the expanded quintets used for LLM evaluation.
359
+
360
+ ```bash
361
+ yentlbench prepare
362
+ # or using make:
363
+ make prepare
364
+ ```
365
+ This generates `dataset_output/dataset_males.csv` and `dataset_output/dataset_quintets.csv`.
366
+
367
+ ### Step 1a: Running the Benchmark on Kaggle (Frontier Models)
368
+
369
+ This benchmark is designed to be run on Kaggle for proprietary API-based frontier models. You can find more details and run the benchmark directly on Kaggle: [Yentlbench Kaggle Benchmark](https://www.kaggle.com/benchmarks/innacampo/yentlbench)
370
+
371
+ ### Step 1b: Running the Benchmark Locally (Open Models)
372
+
373
+ To run local open-weights models, ensure Ollama is installed and running, then use the CLI:
374
+
375
+ ```bash
376
+ yentlbench run --model llama3:8b --variants female male nb_ambiguous nb_label_only
377
+ ```
378
+ This will automatically generate the identical `.run.json` artifacts in your `results/` directory as the Kaggle pipeline does.
379
+
380
+ *New feature*: The local runner now includes robust **row-level resuming**. If your evaluation crashes mid-way, you will not lose your progress. The process will skip already-completed prompts and pick up exactly where it left off.
381
+
382
+ For detailed instructions on mixing runs from Kaggle and local sources into a single benchmark analysis, read [Kaggle vs. Local Runner Workflow](docs/local_vs_kaggle.md).
383
+
384
+ ### Step 2: Merge Runs
385
+
386
+ **Important:** Steps 2–4 are completely agnostic to where your `*.run.json` files came from. Whether you download them from your Kaggle notebook or generate them locally via `yentlbench run`, you just place them in your `results/` folder and the pipeline processes them identically.
387
+
388
+ First, merge all the individual `.run.json` files into a unified CSV.
389
+
390
+ ```bash
391
+ yentlbench merge --results-dir results --output eval/merged_evaluations.csv --include-metrics --verbose
392
+ # or using make:
393
+ make merge
394
+ ```
395
+ **Arguments:**
396
+ - `--results-dir`: Directory containing your `*.run.json` files.
397
+ - `--output`: Where to save the merged CSV.
398
+ - `--include-metrics`: Retains token counts and latency metrics.
399
+ - `--verbose`: Enables debug logging.
400
+
401
+ ### Step 3: Compute Benchmark Statistics
402
+
403
+ Once the data is merged, you can compute per-run performance metrics:
404
+
405
+ ```bash
406
+ yentlbench analyze --input eval/merged_evaluations.csv --output-stats eval/benchmark_stats.csv --output-attention eval/attention --verbose
407
+ # or using make:
408
+ make analyze
409
+ ```
410
+ *Note: The `analyze` command now performs both the benchmark statistics calculation and the deep attention analysis pipeline.*
411
+
412
+ ### Full Pipeline Execution
413
+ If you have your result files in `results/`, you can run the entire workflow (prepare $\to$ run $\to$ merge $\to$ analyze) with a single command:
414
+ ```bash
415
+ make all
416
+ ```
417
+
418
+ ## Interpreting the Output
419
+
420
+ When the pipeline finishes, it will print a **Cross-Model Attention Ranking** to the console.
421
+ - Models are ranked by a **Sensitivity Score**.
422
+ - A lower score indicates the model is more sex-invariant (robust to demographic perturbations).
423
+ - Higher scores suggest the model frequently alters its medical triage prediction depending solely on the patient's sex.
424
+
425
+ Check the `--output-dir` (e.g., `eval/attention/`) for detailed CSV outputs per model, including dangerous triage transitions and category-specific vulnerability matrices.
426
+
427
+ ## Project Structure
428
+ - `src/yentlbench/__main__.py`: Unified CLI entrypoint for the complete workflow.
429
+ - `src/yentlbench/dataset_prep.py`: Prepares MIMIC-IV-ED Demo data, handles the gender quintet expansion, and filters out complaints where sex is a legitimate clinical variable (e.g., abdominal pain).
430
+ - `src/yentlbench/merge_runs.py`: Parses and joins JSON output files.
431
+ - `src/yentlbench/benchmark_stats.py`: Computes per-run benchmark statistics.
432
+ - `src/yentlbench/attention_pipeline/pipeline.py`: Main orchestrator for the analysis suite.
433
+ - `src/yentlbench/config.py`: Configuration for clinical categories, ESI levels, and variant constants.
434
+ - `src/yentlbench/attention_pipeline/analyze_*.py`: Modular scripts containing the statistical tests and calculations for each step of the pipeline.
435
+ - `src/yentlbench/attention_pipeline/visuals.py`: Generates cross-model plots and visualization charts.
436
+ - `src/yentlbench/attention_pipeline/report.py` / `src/yentlbench/attention_pipeline/save.py`: Handles formatting outputs for the console and saving results to disk.
437
+ - `src/yentlbench/attention_pipeline/util.py`: Shared data loading and helper functions.
438
+ - `src/yentlbench/local_runner/`: Submodule containing the robust `OllamaRunner` (with row-level resuming logic), prompt builders, and local output parsers.
439
+
440
+ This work was created as part of the Kaggle competition "Measuring Progress Toward AGI - Cognitive Abilities". [7]
441
+
442
+ ## References & citations
443
+ 1. Gilboy N, et al. Emergency Severity Index, Version 4. AHRQ. 2011.
444
+ 2. Healy, B. The Yentl Syndrome. NEJM, 1991; 325(4):274–276.
445
+ 3. Johnson A, et al. MIMIC-IV-ED Demo (v2.2). PhysioNet. 2023. doi:10.13026/jzz5-vs76
446
+ 4. Cochran WG. The comparison of percentages in matched samples. Biometrika. 1950;37:256–266.
447
+ 5. Friedman M. The use of ranks to avoid the assumption of normality. J Am Stat Assoc. 1937;32:675–701.
448
+ 6. Benjamini Y, Hochberg Y. Controlling the false discovery rate. J R Stat Soc Series B. 1995;57:289–300.
449
+ 7. Plomecka, B, et al. [Measuring Progress Toward AGI - Cognitive Abilities.](https://kaggle.com/competitions/kaggle-measuring-agi) Kaggle; 2026.
450
+