PyPI - llmvalidate - Versions diffs - 0.3.0__tar.gz - Mend

llmvalidate 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

llmvalidate-0.3.0/LICENSE +21 -0
llmvalidate-0.3.0/PKG-INFO +353 -0
llmvalidate-0.3.0/llmvalidate.egg-info/PKG-INFO +353 -0
llmvalidate-0.3.0/llmvalidate.egg-info/SOURCES.txt +16 -0
llmvalidate-0.3.0/llmvalidate.egg-info/dependency_links.txt +1 -0
llmvalidate-0.3.0/llmvalidate.egg-info/top_level.txt +4 -0
llmvalidate-0.3.0/pyproject.toml +46 -0
llmvalidate-0.3.0/readme.md +333 -0
llmvalidate-0.3.0/setup.cfg +4 -0
llmvalidate-0.3.0/src/__init__.py +1 -0
llmvalidate-0.3.0/src/standardize.py +86 -0
llmvalidate-0.3.0/src/structured.py +24 -0
llmvalidate-0.3.0/src/utils.py +195 -0
llmvalidate-0.3.0/src/validation.py +1004 -0
llmvalidate-0.3.0/tests/bootstrap_CI_test.py +229 -0
llmvalidate-0.3.0/tests/compare_results_all_test.py +165 -0
llmvalidate-0.3.0/tests/compare_results_test.py +96 -0
llmvalidate-0.3.0/tests/validate_test.py +758 -0

llmvalidate-0.3.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Oncoshot
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

llmvalidate-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,353 @@
+Metadata-Version: 2.4
+Name: llmvalidate
+Version: 0.3.0
+Summary: Oncoshot LLM validation framework
+License: MIT
+Project-URL: Homepage, https://github.com/Oncoshot/llm-validation-framework
+Project-URL: Repository, https://github.com/Oncoshot/llm-validation-framework
+Project-URL: Bug Tracker, https://github.com/Oncoshot/llm-validation-framework/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Intended Audience :: Developers
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
+# LLM Validation Framework
+A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.
+## ✨ Key Features
+- **Multi-field validation** - Binary (True/False), scalar (single values), and list (multiple values) data types
+- **Partial labeling support** - Handle datasets where different cases have labels for different subsets of fields
+- **Dual usage modes** - Validate pre-computed results OR run live LLM inference with validation
+- **Comprehensive metrics** - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
+- **Confidence analysis** - Automatic performance breakdown by confidence levels
+- **Statistical uncertainty** - Non-parametric bootstrap confidence intervals for all performance metrics
+- **Production ready** - Parallel processing, intelligent caching, detailed progress tracking
+## 🚀 Quick Start
+### Prerequisites
+```bash
+# Install from PyPI
+pip install llmvalidate
+# OR install from source
+pip install -r requirements.txt  # Python 3.11+ required
+```
+### Demo
+```bash
+python runme.py
+```
+Processes the included [samples.csv](samples.csv) (14 test cases covering all validation scenarios) and outputs timestamped results to `validation_results/samples/`:
+- **[Results CSV](validation_results/samples/2026-02-23%2012-42-40%20results.csv)** - Row-by-row comparison with confusion matrix counts and item-level details
+- **[Metrics CSV](validation_results/samples/2026-02-23%2012-42-40%20metrics.csv)** - Aggregated performance statistics with confidence breakdowns
+- **[CI Metrics CSV](validation_results/samples/2026-02-23%2012-42-40%20CI%20metrics.csv)** - Confidence intervals for metrics
+| Rows | Field Type | Test Scenarios |
+|------|------------|----------------|
+| **1-4** | Binary (`Has metastasis`) | True Positive, True Negative, False Positive, False Negative |
+| **5-9** | Scalar (`Diagnosis`, `Histology`) | Correct, incorrect, missing, spurious, and empty extractions |
+| **10-14** | List (`Treatment Drugs`, `Test Results`) | Perfect match, spurious items, missing items, correct empty, mixed results |
+## 📊 Usage Modes
+### Mode 1: Validate Existing Results
+When you have LLM predictions in `Res: {Field Name}` columns:
+```python
+import pandas as pd
+from src.validation import validate
+df = pd.read_csv("data.csv", index_col="Patient ID")
+# df must contain: "Field Name" and "Res: Field Name" columns
+results_df, metrics_df = validate(
+    source_df=df,
+    fields=["Diagnosis", "Treatment"],  # or None for auto-detection
+    structure_callback=None,
+    output_folder="validation_results"
+)
+```
+### Mode 2: Live LLM Inference + Validation
+```python
+from src.structured import StructuredResult, StructuredGroup, StructuredField
+from src.utils import flatten_structured_result
+def llm_callback(row, i, raw_text_column_name):
+    raw_text = row[raw_text_column_name]
+    # Your LLM inference logic here
+    result = StructuredResult(
+        groups=[StructuredGroup(
+            group_name="medical",
+            fields=[
+                StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
+                StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
+            ]
+        )]
+    )
+    return flatten_structured_result(result), {}
+results_df, metrics_df = validate(
+    source_df=df,
+    fields=["Diagnosis", "Treatment"],
+    structure_callback=llm_callback,
+    raw_text_column_name="medical_report",
+    output_folder="validation_results",
+    max_workers=4
+)
+```
+## 📋 Input Data Requirements
+### DataFrame Format
+- **Unique index** - Each row must have a unique identifier (e.g., "Patient ID")
+- **Label columns** - Ground truth values for each field you want to validate
+- **Result columns** (Mode 1 only) - LLM predictions as `Res: {Field Name}` columns
+- **Raw text column** (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")
+### Supported Field Types
+| Type | Description | Label Examples | Result Examples |
+|------|-------------|----------------|-----------------|
+| **Binary** | True/False detection | `True`, `False` | `True`, `False` |
+| **Scalar** | Single text/numeric value | `"Lung Cancer"` <br> `42` | `"Breast Cancer"` <br> `38` |
+| **List** | Multiple values | `["Drug A", "Drug B"]` <br> `"['Item1', 'Item2']"` | `["Drug A"]` <br> `[]` |
+### Special Value Handling
+- **`"-"`** = Labeled as "No information is available in the source document"
+- **`null/empty/NaN`** = Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)
+- **Lists** - Can be Python lists `["a", "b"]` or stringified `"['a', 'b']"` (auto-converted)
+### Partial Labeling Support
+The framework supports partial labeling scenarios where:
+- Not every case needs labels for every field
+- Different cases can have labels for different subsets of fields
+- Missing labels (`null`/`NaN`) are handled gracefully in all metrics calculations
+- Use `"-"` when the document explicitly lacks information about a field
+- Use `null`/`NaN` when the field simply wasn't labeled for that case
+## 📈 Output Files
+The framework generates two timestamped CSV files for each validation run:
+### 1. Results CSV (`YYYY-MM-DD HH-MM-SS results.csv`)
+**Row-level analysis** with detailed per-case metrics:
+**Original Data:**
+- All input columns (labels, raw text, etc.)
+- `Res: {Field}` columns with LLM predictions
+- `Res: {Field} confidence` and `Res: {Field} justification` (if available)
+**Binary Fields:**
+- `TP/FP/FN/TN: {Field}` - Confusion matrix counts (1 or 0 per row)
+**Non-Binary Fields:**
+- `Cor/Inc/Mis/Spu: {Field}` - Item counts per row
+- `Cor/Inc/Mis/Spu: {Field} items` - Actual item lists
+- `Precision/Recall/F1/F2: {Field}` - Per-row metrics (list fields only)
+**System Columns:**
+- `Sys: from cache` - Whether result was cached (speeds up duplicate text)
+- `Sys: exception` - Error information if processing failed
+- `Sys: time taken` - Processing time per row in seconds
+### 2. Metrics CSV (`YYYY-MM-DD HH-MM-SS metrics.csv`)
+**Aggregated statistics** with confidence breakdowns:
+**Core Information:**
+- `field` - Field name being evaluated
+- `confidence` - Confidence level ("Overall", "High", "Medium", "Low", etc.)
+- `labeled cases` - Total rows with ground truth labels
+- `field-present cases` - Rows where document has information about the field (label is not '-')
+**Binary Metrics:** `TP`, `TN`, `FP`, `FN`, `precision`, `recall`, `F1/F2`, `accuracy`, `specificity`
+**Non-Binary Metrics:** `cor`, `inc`, `mis`, `spu`, `precision/recall/F1/F2 (micro)`, `precision/recall/F1/F2 (macro)`
+## ⚡ Performance Metrics Explained
+### Binary Classification Metrics
+For fields with True/False values (e.g., "Has metastasis"):
+#### Confusion Matrix Counts
+| Count | Definition | Example |
+|-------|------------|---------|
+| **TP (True Positive)** | Correctly predicted positive | Label: `True`, Prediction: `True` → TP=1 |
+| **TN (True Negative)** | Correctly predicted negative | Label: `False`, Prediction: `False` → TN=1 |
+| **FP (False Positive)** | Incorrectly predicted positive | Label: `False`, Prediction: `True` → FP=1 |
+| **FN (False Negative)** | Incorrectly predicted negative | Label: `True`, Prediction: `False` → FN=1 |
+#### Binary Classification Formulas
+| Metric | Formula | Meaning |
+|--------|---------|---------|
+| **Precision** | `TP / (TP + FP)` | Of all positive predictions, how many were correct? |
+| **Recall** | `TP / (TP + FN)` | Of all actual positives, how many were found? |
+| **Accuracy** | `(TP + TN) / (TP + TN + FP + FN)` | Overall percentage of correct predictions |
+| **Specificity** | `TN / (TN + FP)` | Of all actual negatives, how many were correctly identified? |
+### Structured Extraction Metrics
+For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):
+#### Core Counts (Per Case Analysis)
+| Count | Definition | Example |
+|-------|------------|---------|
+| **Correct (Cor)** | Items extracted correctly | Label: `["DrugA", "DrugB"]`, Prediction: `["DrugA"]` → Cor=1 |
+| **Missing (Mis)** | Items present in label but not extracted | (Same example) → Mis=1 (DrugB missing) |
+| **Spurious (Spu)** | Items extracted but not in label | Label: `["DrugA"]`, Prediction: `["DrugA", "DrugC"]` → Spu=1 |
+| **Incorrect (Inc)** | Wrong values for scalar fields | Label: `"Cancer"`, Prediction: `"Diabetes"` → Inc=1 |
+#### Structured Extraction Formulas
+| Metric | Formula | Meaning |
+|--------|---------|---------|
+| **Precision** | `Cor / (Cor + Spu + Inc)` | Of all extracted items, how many were correct? |
+| **Recall** | `Cor / (Cor + Mis + Inc)` | Of all labeled items, how many were correctly extracted? |
+**Note:** For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.
+The following formulas apply to both binary classification and structured extraction metrics:
+| Metric | Formula | Meaning |
+|--------|---------|--------|
+| **F1 Score** | `2 × (P × R) / (P + R)` | Balanced harmonic mean of precision and recall |
+| **F2 Score** | `5 × (P × R) / (4P + R)` | Recall-weighted F-score (emphasizes recall over precision) |
+Where P = Precision and R = Recall (calculated differently for each metric type).
+## Bootstrap Confidence Intervals
+The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.
+### Usage
+```python
+from src.validation import bootstrap_CI
+# After running validation to get results_df
+ci_results = bootstrap_CI(
+    res_df=results_df,           # Results from validate() function
+    fields=["diagnosis", "treatment"],  # Fields to analyze (or None for auto-detect)
+    n_bootstrap=5000,            # Number of bootstrap samples (default: 5000)
+    ci=0.95,                     # Confidence level (default: 0.95 for 95% CI)
+    random_state=42              # For reproducible results
+)
+```
+### Bootstrap Method
+- **Resampling unit**: Individual cases (not individual predictions)
+- **Resampling strategy**: Sample with replacement to preserve original dataset size
+- **CI calculation**: Percentile method using bootstrap distribution
+- **Partial labeling**: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
+- **Metrics included**: All validation metrics (precision, recall, F1, accuracy, etc.)
+### Output Format
+The `bootstrap_CI()` function returns a DataFrame with confidence intervals for each field:
+| Column | Description |
+|--------|-------------|
+| `field` | Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters) |
+| `labeled cases` | Number of labeled cases in the dataset |
+| `{metric}: mean` | Bootstrap mean estimate |
+| `{metric}: lower` | Lower bound of confidence interval |
+| `{metric}: upper` | Upper bound of confidence interval |
+Example output:
+```
+        field  labeled cases  precision (micro): mean  precision (micro): lower  precision (micro): upper
+0  exceptions          1000                       NaN                       NaN                       NaN
+1   diagnosis          1000                      0.82                      0.79                      0.85
+2   treatment          1000                      0.91                      0.88                      0.94
+3  N=5000; CI=95%       NaN                       NaN                       NaN                       NaN
+```
+The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).
+### Use Cases
+- **Performance assessment**: Quantify uncertainty in reported metrics
+- **Model comparison**: Determine if performance differences are statistically significant
+- **Sample size planning**: Understand precision of estimates with current dataset size
+- **Publication**: Report confidence intervals alongside point estimates
+## 🛠️ Advanced Configuration
+### Parallel Processing
+```python
+validate(
+    source_df=df,
+    fields=["diagnosis", "treatment"],
+    structure_callback=callback,
+    max_workers=None,      # Auto-detect CPU count (or specify number)
+    use_threads=True       # True for I/O-bound (LLM API calls), False for CPU-bound
+)
+```
+### Performance Features
+- **Automatic caching** - Identical raw text inputs are deduplicated and cached
+- **Progress tracking** - Real-time progress bar for long-running validations
+- **Cache statistics** - Check `Sys: from cache` column in results to monitor cache hits
+### Confidence Analysis
+When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects `Res: {Field} confidence` columns and generates:
+- Separate metrics for each unique confidence level found in your data
+- Overall metrics aggregating across all confidence levels
+- Useful for setting confidence thresholds and analyzing prediction reliability
+## 🧪 Development & Testing
+```bash
+# Install development dependencies
+pip install -r requirements.txt
+# Run all tests
+pytest
+# Run with coverage reporting
+pytest --cov=src
+# Run specific test modules
+pytest tests/validate_test.py              # Core validation logic
+pytest tests/compare_results_test.py       # Comparison algorithms
+pytest tests/compare_results_all_test.py   # End-to-end comparisons
+```
+## 📁 Project Structure
+```
+llm-validation-framework/
+├── src/
+│   ├── validation.py     # Main validation pipeline and metrics calculation
+│   ├── structured.py     # Pydantic data models for LLM results
+│   ├── utils.py         # Utility functions (list conversion, flattening)
+│   └── standardize.py   # Data standardization helpers
+├── tests/               # Comprehensive test suite
+├── validation_results/  # Output directory (auto-created)
+├── samples.csv         # Demo dataset with all validation scenarios
+├── runme.py           # Demo script
+└── requirements.txt   # Dependencies (pandas, pydantic, tqdm, etc.)
+```
+## 🔧 Troubleshooting
+| Error | Solution |
+|-------|----------|
+| **"Cannot infer fields"** | Ensure DataFrame has both `{Field}` and `Res: {Field}` columns when `structure_callback=None` |
+| **"Missing fields"** | Verify `fields` parameter contains column names that exist in your DataFrame |
+| **"Duplicate index"** | Use `df.reset_index(drop=True)` or ensure your DataFrame index has unique values |
+| **Import/dependency errors** | Run `pip install -r requirements.txt` and verify Python 3.11+ |
+| **Slow performance** | Enable parallel processing with `max_workers=None` and `use_threads=True` for LLM API calls |
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.