llmvalidate 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Oncoshot
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,353 @@
1
+ Metadata-Version: 2.4
2
+ Name: llmvalidate
3
+ Version: 0.3.0
4
+ Summary: Oncoshot LLM validation framework
5
+ License: MIT
6
+ Project-URL: Homepage, https://github.com/Oncoshot/llm-validation-framework
7
+ Project-URL: Repository, https://github.com/Oncoshot/llm-validation-framework
8
+ Project-URL: Bug Tracker, https://github.com/Oncoshot/llm-validation-framework/issues
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Programming Language :: Python :: 3.13
13
+ Classifier: Programming Language :: Python :: 3.14
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Intended Audience :: Developers
16
+ Requires-Python: >=3.11
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE
19
+ Dynamic: license-file
20
+
21
+ # LLM Validation Framework
22
+
23
+ A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.
24
+
25
+ ## โœจ Key Features
26
+
27
+ - **Multi-field validation** - Binary (True/False), scalar (single values), and list (multiple values) data types
28
+ - **Partial labeling support** - Handle datasets where different cases have labels for different subsets of fields
29
+ - **Dual usage modes** - Validate pre-computed results OR run live LLM inference with validation
30
+ - **Comprehensive metrics** - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
31
+ - **Confidence analysis** - Automatic performance breakdown by confidence levels
32
+ - **Statistical uncertainty** - Non-parametric bootstrap confidence intervals for all performance metrics
33
+ - **Production ready** - Parallel processing, intelligent caching, detailed progress tracking
34
+
35
+ ## ๐Ÿš€ Quick Start
36
+
37
+ ### Prerequisites
38
+ ```bash
39
+ # Install from PyPI
40
+ pip install llmvalidate
41
+
42
+ # OR install from source
43
+ pip install -r requirements.txt # Python 3.11+ required
44
+ ```
45
+
46
+ ### Demo
47
+ ```bash
48
+ python runme.py
49
+ ```
50
+
51
+ Processes the included [samples.csv](samples.csv) (14 test cases covering all validation scenarios) and outputs timestamped results to `validation_results/samples/`:
52
+
53
+ - **[Results CSV](validation_results/samples/2026-02-23%2012-42-40%20results.csv)** - Row-by-row comparison with confusion matrix counts and item-level details
54
+ - **[Metrics CSV](validation_results/samples/2026-02-23%2012-42-40%20metrics.csv)** - Aggregated performance statistics with confidence breakdowns
55
+ - **[CI Metrics CSV](validation_results/samples/2026-02-23%2012-42-40%20CI%20metrics.csv)** - Confidence intervals for metrics
56
+
57
+ | Rows | Field Type | Test Scenarios |
58
+ |------|------------|----------------|
59
+ | **1-4** | Binary (`Has metastasis`) | True Positive, True Negative, False Positive, False Negative |
60
+ | **5-9** | Scalar (`Diagnosis`, `Histology`) | Correct, incorrect, missing, spurious, and empty extractions |
61
+ | **10-14** | List (`Treatment Drugs`, `Test Results`) | Perfect match, spurious items, missing items, correct empty, mixed results |
62
+
63
+ ## ๐Ÿ“Š Usage Modes
64
+
65
+ ### Mode 1: Validate Existing Results
66
+ When you have LLM predictions in `Res: {Field Name}` columns:
67
+
68
+ ```python
69
+ import pandas as pd
70
+ from src.validation import validate
71
+
72
+ df = pd.read_csv("data.csv", index_col="Patient ID")
73
+ # df must contain: "Field Name" and "Res: Field Name" columns
74
+
75
+ results_df, metrics_df = validate(
76
+ source_df=df,
77
+ fields=["Diagnosis", "Treatment"], # or None for auto-detection
78
+ structure_callback=None,
79
+ output_folder="validation_results"
80
+ )
81
+ ```
82
+
83
+ ### Mode 2: Live LLM Inference + Validation
84
+
85
+ ```python
86
+ from src.structured import StructuredResult, StructuredGroup, StructuredField
87
+ from src.utils import flatten_structured_result
88
+
89
+ def llm_callback(row, i, raw_text_column_name):
90
+ raw_text = row[raw_text_column_name]
91
+ # Your LLM inference logic here
92
+ result = StructuredResult(
93
+ groups=[StructuredGroup(
94
+ group_name="medical",
95
+ fields=[
96
+ StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
97
+ StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
98
+ ]
99
+ )]
100
+ )
101
+ return flatten_structured_result(result), {}
102
+
103
+ results_df, metrics_df = validate(
104
+ source_df=df,
105
+ fields=["Diagnosis", "Treatment"],
106
+ structure_callback=llm_callback,
107
+ raw_text_column_name="medical_report",
108
+ output_folder="validation_results",
109
+ max_workers=4
110
+ )
111
+ ```
112
+
113
+ ## ๐Ÿ“‹ Input Data Requirements
114
+
115
+ ### DataFrame Format
116
+ - **Unique index** - Each row must have a unique identifier (e.g., "Patient ID")
117
+ - **Label columns** - Ground truth values for each field you want to validate
118
+ - **Result columns** (Mode 1 only) - LLM predictions as `Res: {Field Name}` columns
119
+ - **Raw text column** (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")
120
+
121
+ ### Supported Field Types
122
+
123
+ | Type | Description | Label Examples | Result Examples |
124
+ |------|-------------|----------------|-----------------|
125
+ | **Binary** | True/False detection | `True`, `False` | `True`, `False` |
126
+ | **Scalar** | Single text/numeric value | `"Lung Cancer"` <br> `42` | `"Breast Cancer"` <br> `38` |
127
+ | **List** | Multiple values | `["Drug A", "Drug B"]` <br> `"['Item1', 'Item2']"` | `["Drug A"]` <br> `[]` |
128
+
129
+ ### Special Value Handling
130
+ - **`"-"`** = Labeled as "No information is available in the source document"
131
+ - **`null/empty/NaN`** = Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)
132
+ - **Lists** - Can be Python lists `["a", "b"]` or stringified `"['a', 'b']"` (auto-converted)
133
+
134
+ ### Partial Labeling Support
135
+ The framework supports partial labeling scenarios where:
136
+ - Not every case needs labels for every field
137
+ - Different cases can have labels for different subsets of fields
138
+ - Missing labels (`null`/`NaN`) are handled gracefully in all metrics calculations
139
+ - Use `"-"` when the document explicitly lacks information about a field
140
+ - Use `null`/`NaN` when the field simply wasn't labeled for that case
141
+
142
+ ## ๐Ÿ“ˆ Output Files
143
+
144
+ The framework generates two timestamped CSV files for each validation run:
145
+
146
+ ### 1. Results CSV (`YYYY-MM-DD HH-MM-SS results.csv`)
147
+ **Row-level analysis** with detailed per-case metrics:
148
+
149
+ **Original Data:**
150
+ - All input columns (labels, raw text, etc.)
151
+ - `Res: {Field}` columns with LLM predictions
152
+ - `Res: {Field} confidence` and `Res: {Field} justification` (if available)
153
+
154
+ **Binary Fields:**
155
+ - `TP/FP/FN/TN: {Field}` - Confusion matrix counts (1 or 0 per row)
156
+
157
+ **Non-Binary Fields:**
158
+ - `Cor/Inc/Mis/Spu: {Field}` - Item counts per row
159
+ - `Cor/Inc/Mis/Spu: {Field} items` - Actual item lists
160
+ - `Precision/Recall/F1/F2: {Field}` - Per-row metrics (list fields only)
161
+
162
+ **System Columns:**
163
+ - `Sys: from cache` - Whether result was cached (speeds up duplicate text)
164
+ - `Sys: exception` - Error information if processing failed
165
+ - `Sys: time taken` - Processing time per row in seconds
166
+
167
+ ### 2. Metrics CSV (`YYYY-MM-DD HH-MM-SS metrics.csv`)
168
+ **Aggregated statistics** with confidence breakdowns:
169
+
170
+ **Core Information:**
171
+ - `field` - Field name being evaluated
172
+ - `confidence` - Confidence level ("Overall", "High", "Medium", "Low", etc.)
173
+ - `labeled cases` - Total rows with ground truth labels
174
+ - `field-present cases` - Rows where document has information about the field (label is not '-')
175
+
176
+ **Binary Metrics:** `TP`, `TN`, `FP`, `FN`, `precision`, `recall`, `F1/F2`, `accuracy`, `specificity`
177
+
178
+ **Non-Binary Metrics:** `cor`, `inc`, `mis`, `spu`, `precision/recall/F1/F2 (micro)`, `precision/recall/F1/F2 (macro)`
179
+
180
+ ## โšก Performance Metrics Explained
181
+ ### Binary Classification Metrics
182
+
183
+ For fields with True/False values (e.g., "Has metastasis"):
184
+
185
+ #### Confusion Matrix Counts
186
+ | Count | Definition | Example |
187
+ |-------|------------|---------|
188
+ | **TP (True Positive)** | Correctly predicted positive | Label: `True`, Prediction: `True` โ†’ TP=1 |
189
+ | **TN (True Negative)** | Correctly predicted negative | Label: `False`, Prediction: `False` โ†’ TN=1 |
190
+ | **FP (False Positive)** | Incorrectly predicted positive | Label: `False`, Prediction: `True` โ†’ FP=1 |
191
+ | **FN (False Negative)** | Incorrectly predicted negative | Label: `True`, Prediction: `False` โ†’ FN=1 |
192
+
193
+ #### Binary Classification Formulas
194
+ | Metric | Formula | Meaning |
195
+ |--------|---------|---------|
196
+ | **Precision** | `TP / (TP + FP)` | Of all positive predictions, how many were correct? |
197
+ | **Recall** | `TP / (TP + FN)` | Of all actual positives, how many were found? |
198
+ | **Accuracy** | `(TP + TN) / (TP + TN + FP + FN)` | Overall percentage of correct predictions |
199
+ | **Specificity** | `TN / (TN + FP)` | Of all actual negatives, how many were correctly identified? |
200
+ ### Structured Extraction Metrics
201
+
202
+ For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):
203
+
204
+ #### Core Counts (Per Case Analysis)
205
+ | Count | Definition | Example |
206
+ |-------|------------|---------|
207
+ | **Correct (Cor)** | Items extracted correctly | Label: `["DrugA", "DrugB"]`, Prediction: `["DrugA"]` โ†’ Cor=1 |
208
+ | **Missing (Mis)** | Items present in label but not extracted | (Same example) โ†’ Mis=1 (DrugB missing) |
209
+ | **Spurious (Spu)** | Items extracted but not in label | Label: `["DrugA"]`, Prediction: `["DrugA", "DrugC"]` โ†’ Spu=1 |
210
+ | **Incorrect (Inc)** | Wrong values for scalar fields | Label: `"Cancer"`, Prediction: `"Diabetes"` โ†’ Inc=1 |
211
+
212
+ #### Structured Extraction Formulas
213
+
214
+ | Metric | Formula | Meaning |
215
+ |--------|---------|---------|
216
+ | **Precision** | `Cor / (Cor + Spu + Inc)` | Of all extracted items, how many were correct? |
217
+ | **Recall** | `Cor / (Cor + Mis + Inc)` | Of all labeled items, how many were correctly extracted? |
218
+
219
+ **Note:** For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.
220
+
221
+ The following formulas apply to both binary classification and structured extraction metrics:
222
+
223
+ | Metric | Formula | Meaning |
224
+ |--------|---------|--------|
225
+ | **F1 Score** | `2 ร— (P ร— R) / (P + R)` | Balanced harmonic mean of precision and recall |
226
+ | **F2 Score** | `5 ร— (P ร— R) / (4P + R)` | Recall-weighted F-score (emphasizes recall over precision) |
227
+
228
+ Where P = Precision and R = Recall (calculated differently for each metric type).
229
+
230
+ ## Bootstrap Confidence Intervals
231
+
232
+ The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.
233
+
234
+ ### Usage
235
+ ```python
236
+ from src.validation import bootstrap_CI
237
+
238
+ # After running validation to get results_df
239
+ ci_results = bootstrap_CI(
240
+ res_df=results_df, # Results from validate() function
241
+ fields=["diagnosis", "treatment"], # Fields to analyze (or None for auto-detect)
242
+ n_bootstrap=5000, # Number of bootstrap samples (default: 5000)
243
+ ci=0.95, # Confidence level (default: 0.95 for 95% CI)
244
+ random_state=42 # For reproducible results
245
+ )
246
+ ```
247
+
248
+ ### Bootstrap Method
249
+ - **Resampling unit**: Individual cases (not individual predictions)
250
+ - **Resampling strategy**: Sample with replacement to preserve original dataset size
251
+ - **CI calculation**: Percentile method using bootstrap distribution
252
+ - **Partial labeling**: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
253
+ - **Metrics included**: All validation metrics (precision, recall, F1, accuracy, etc.)
254
+
255
+ ### Output Format
256
+ The `bootstrap_CI()` function returns a DataFrame with confidence intervals for each field:
257
+
258
+ | Column | Description |
259
+ |--------|-------------|
260
+ | `field` | Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters) |
261
+ | `labeled cases` | Number of labeled cases in the dataset |
262
+ | `{metric}: mean` | Bootstrap mean estimate |
263
+ | `{metric}: lower` | Lower bound of confidence interval |
264
+ | `{metric}: upper` | Upper bound of confidence interval |
265
+
266
+ Example output:
267
+ ```
268
+ field labeled cases precision (micro): mean precision (micro): lower precision (micro): upper
269
+ 0 exceptions 1000 NaN NaN NaN
270
+ 1 diagnosis 1000 0.82 0.79 0.85
271
+ 2 treatment 1000 0.91 0.88 0.94
272
+ 3 N=5000; CI=95% NaN NaN NaN NaN
273
+ ```
274
+
275
+ The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).
276
+
277
+ ### Use Cases
278
+ - **Performance assessment**: Quantify uncertainty in reported metrics
279
+ - **Model comparison**: Determine if performance differences are statistically significant
280
+ - **Sample size planning**: Understand precision of estimates with current dataset size
281
+ - **Publication**: Report confidence intervals alongside point estimates
282
+
283
+ ## ๐Ÿ› ๏ธ Advanced Configuration
284
+
285
+ ### Parallel Processing
286
+ ```python
287
+ validate(
288
+ source_df=df,
289
+ fields=["diagnosis", "treatment"],
290
+ structure_callback=callback,
291
+ max_workers=None, # Auto-detect CPU count (or specify number)
292
+ use_threads=True # True for I/O-bound (LLM API calls), False for CPU-bound
293
+ )
294
+ ```
295
+
296
+ ### Performance Features
297
+ - **Automatic caching** - Identical raw text inputs are deduplicated and cached
298
+ - **Progress tracking** - Real-time progress bar for long-running validations
299
+ - **Cache statistics** - Check `Sys: from cache` column in results to monitor cache hits
300
+
301
+ ### Confidence Analysis
302
+ When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects `Res: {Field} confidence` columns and generates:
303
+ - Separate metrics for each unique confidence level found in your data
304
+ - Overall metrics aggregating across all confidence levels
305
+ - Useful for setting confidence thresholds and analyzing prediction reliability
306
+
307
+ ## ๐Ÿงช Development & Testing
308
+
309
+ ```bash
310
+ # Install development dependencies
311
+ pip install -r requirements.txt
312
+
313
+ # Run all tests
314
+ pytest
315
+
316
+ # Run with coverage reporting
317
+ pytest --cov=src
318
+
319
+ # Run specific test modules
320
+ pytest tests/validate_test.py # Core validation logic
321
+ pytest tests/compare_results_test.py # Comparison algorithms
322
+ pytest tests/compare_results_all_test.py # End-to-end comparisons
323
+ ```
324
+
325
+ ## ๐Ÿ“ Project Structure
326
+
327
+ ```
328
+ llm-validation-framework/
329
+ โ”œโ”€โ”€ src/
330
+ โ”‚ โ”œโ”€โ”€ validation.py # Main validation pipeline and metrics calculation
331
+ โ”‚ โ”œโ”€โ”€ structured.py # Pydantic data models for LLM results
332
+ โ”‚ โ”œโ”€โ”€ utils.py # Utility functions (list conversion, flattening)
333
+ โ”‚ โ””โ”€โ”€ standardize.py # Data standardization helpers
334
+ โ”œโ”€โ”€ tests/ # Comprehensive test suite
335
+ โ”œโ”€โ”€ validation_results/ # Output directory (auto-created)
336
+ โ”œโ”€โ”€ samples.csv # Demo dataset with all validation scenarios
337
+ โ”œโ”€โ”€ runme.py # Demo script
338
+ โ””โ”€โ”€ requirements.txt # Dependencies (pandas, pydantic, tqdm, etc.)
339
+ ```
340
+
341
+ ## ๐Ÿ”ง Troubleshooting
342
+
343
+ | Error | Solution |
344
+ |-------|----------|
345
+ | **"Cannot infer fields"** | Ensure DataFrame has both `{Field}` and `Res: {Field}` columns when `structure_callback=None` |
346
+ | **"Missing fields"** | Verify `fields` parameter contains column names that exist in your DataFrame |
347
+ | **"Duplicate index"** | Use `df.reset_index(drop=True)` or ensure your DataFrame index has unique values |
348
+ | **Import/dependency errors** | Run `pip install -r requirements.txt` and verify Python 3.11+ |
349
+ | **Slow performance** | Enable parallel processing with `max_workers=None` and `use_threads=True` for LLM API calls |
350
+
351
+ ## ๐Ÿ“„ License
352
+
353
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.