validity-screen 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Jon-Paul Cacioli
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,157 @@
1
+ Metadata-Version: 2.4
2
+ Name: validity-screen
3
+ Version: 0.1.0
4
+ Summary: Validity screening protocol for LLM confidence signals
5
+ Author: Jon-Paul Cacioli
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/synthiumjp/validity-scaling-llm
8
+ Project-URL: Documentation, https://github.com/synthiumjp/validity-scaling-llm/tree/master/screen
9
+ Project-URL: Repository, https://github.com/synthiumjp/validity-scaling-llm
10
+ Project-URL: Issues, https://github.com/synthiumjp/validity-scaling-llm/issues
11
+ Keywords: llm,confidence,validity,metacognition,calibration,selective-prediction,screening,psychometrics,evaluation
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Requires-Python: >=3.8
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: numpy>=1.20
26
+ Requires-Dist: scipy>=1.7
27
+ Dynamic: license-file
28
+
29
+ # validity-screen
30
+
31
+ **Check whether an LLM's confidence signal carries information before you build on it.**
32
+
33
+ [![PyPI](https://img.shields.io/pypi/v/validity-screen)](https://pypi.org/project/validity-screen/)
34
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
35
+
36
+ Implements the screening protocol from:
37
+
38
+ > Cacioli, J. P. (2026). *Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals.* arXiv.
39
+
40
+ ## Install
41
+
42
+ ```bash
43
+ pip install validity-screen
44
+ ```
45
+
46
+ ## Quick start (Python)
47
+
48
+ ```python
49
+ import numpy as np
50
+ from validity_screen import screen
51
+
52
+ # Your data: item-level correctness and confidence
53
+ correct = np.array([True, True, False, True, False, True, True, False])
54
+ confidence = np.array([True, True, True, True, False, True, False, False])
55
+
56
+ result = screen(correct, confidence, model_name="My Model")
57
+
58
+ print(result.tier) # 'Valid', 'Indeterminate', or 'Invalid'
59
+ print(result.vrs_table()) # Complete reporting table
60
+ ```
61
+
62
+ ## Quick start (command line)
63
+
64
+ ```bash
65
+ # From a CSV with 'correct' and 'confidence' columns
66
+ validity-screen run --data my_data.csv --model-name "GPT-5.4"
67
+
68
+ # From separate files
69
+ validity-screen run --correct correct.txt --confidence confidence.txt
70
+
71
+ # Continuous confidence? Auto-binarised at median
72
+ validity-screen run --data my_data.csv --confidence-col prob --threshold 0.5
73
+
74
+ # JSON output for pipelines
75
+ validity-screen run --data my_data.csv --json
76
+ ```
77
+
78
+ ## What it does
79
+
80
+ Before computing calibration metrics (ECE), metacognitive sensitivity (meta-d', AUROC), or selective prediction accuracy, this protocol checks whether the confidence signal carries item-level information about correctness. If it doesn't, those downstream metrics are fitting noise.
81
+
82
+ Five values from a 2x2 contingency table. Three possible outcomes.
83
+
84
+ | Tier | Meaning | Action |
85
+ |------|---------|--------|
86
+ | **Valid** | Confidence tracks correctness | Proceed with downstream metrics |
87
+ | **Indeterminate** | Near threshold, uncertain | Compute but flag; consider more items |
88
+ | **Invalid** | Confidence does not discriminate | Do not interpret AUROC, ECE, selective prediction |
89
+
90
+ ## Indices
91
+
92
+ | Index | What it detects | Invalid threshold |
93
+ |-------|-----------------|-------------------|
94
+ | **L** | Blanket confidence on errors | >= 0.95 |
95
+ | **Fp** | Over-withdrawal of correct items | >= 0.50 |
96
+ | **RBS** | Inverted monitoring direction | > 0 (CI excludes zero) |
97
+ | **TRIN** | Fixed responding | >= 0.95 (warning only) |
98
+ | **r** | Item-level sensitivity | Reported, not thresholded |
99
+
100
+ ## Batch screening
101
+
102
+ ```python
103
+ from validity_screen import screen_batch, summary_table
104
+
105
+ models = {
106
+ "GPT-5.4": {"correct": correct_gpt, "confidence": conf_gpt},
107
+ "Claude": {"correct": correct_claude, "confidence": conf_claude},
108
+ "Gemini": {"correct": correct_gemini, "confidence": conf_gemini},
109
+ }
110
+
111
+ results = screen_batch(models, benchmark_name="MMLU")
112
+ print(summary_table(results))
113
+ ```
114
+
115
+ ## Continuous confidence
116
+
117
+ ```python
118
+ from validity_screen import screen, binarise
119
+
120
+ # Binarise at a fixed threshold
121
+ confidence_binary = binarise(confidence_continuous, threshold=50)
122
+
123
+ # Or at the sample median
124
+ confidence_binary = binarise(confidence_continuous, method='median')
125
+
126
+ result = screen(correct, confidence_binary)
127
+ ```
128
+
129
+ ## Requirements
130
+
131
+ - Python >= 3.8
132
+ - NumPy >= 1.20
133
+ - SciPy >= 1.7
134
+
135
+ ## Citation
136
+
137
+ ```bibtex
138
+ @article{cacioli2026screen,
139
+ title={Screen Before You Interpret: A Portable Validity Protocol for
140
+ Benchmark-Based LLM Confidence Signals},
141
+ author={Cacioli, Jon-Paul},
142
+ journal={arXiv preprint},
143
+ year={2026}
144
+ }
145
+
146
+ @article{cacioli2026validity,
147
+ title={Before You Interpret the Profile: Validity Scaling for
148
+ LLM Metacognitive Self-Report},
149
+ author={Cacioli, Jon-Paul},
150
+ journal={arXiv preprint},
151
+ year={2026}
152
+ }
153
+ ```
154
+
155
+ ## License
156
+
157
+ MIT
@@ -0,0 +1,129 @@
1
+ # validity-screen
2
+
3
+ **Check whether an LLM's confidence signal carries information before you build on it.**
4
+
5
+ [![PyPI](https://img.shields.io/pypi/v/validity-screen)](https://pypi.org/project/validity-screen/)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
+
8
+ Implements the screening protocol from:
9
+
10
+ > Cacioli, J. P. (2026). *Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals.* arXiv.
11
+
12
+ ## Install
13
+
14
+ ```bash
15
+ pip install validity-screen
16
+ ```
17
+
18
+ ## Quick start (Python)
19
+
20
+ ```python
21
+ import numpy as np
22
+ from validity_screen import screen
23
+
24
+ # Your data: item-level correctness and confidence
25
+ correct = np.array([True, True, False, True, False, True, True, False])
26
+ confidence = np.array([True, True, True, True, False, True, False, False])
27
+
28
+ result = screen(correct, confidence, model_name="My Model")
29
+
30
+ print(result.tier) # 'Valid', 'Indeterminate', or 'Invalid'
31
+ print(result.vrs_table()) # Complete reporting table
32
+ ```
33
+
34
+ ## Quick start (command line)
35
+
36
+ ```bash
37
+ # From a CSV with 'correct' and 'confidence' columns
38
+ validity-screen run --data my_data.csv --model-name "GPT-5.4"
39
+
40
+ # From separate files
41
+ validity-screen run --correct correct.txt --confidence confidence.txt
42
+
43
+ # Continuous confidence? Auto-binarised at median
44
+ validity-screen run --data my_data.csv --confidence-col prob --threshold 0.5
45
+
46
+ # JSON output for pipelines
47
+ validity-screen run --data my_data.csv --json
48
+ ```
49
+
50
+ ## What it does
51
+
52
+ Before computing calibration metrics (ECE), metacognitive sensitivity (meta-d', AUROC), or selective prediction accuracy, this protocol checks whether the confidence signal carries item-level information about correctness. If it doesn't, those downstream metrics are fitting noise.
53
+
54
+ Five values from a 2x2 contingency table. Three possible outcomes.
55
+
56
+ | Tier | Meaning | Action |
57
+ |------|---------|--------|
58
+ | **Valid** | Confidence tracks correctness | Proceed with downstream metrics |
59
+ | **Indeterminate** | Near threshold, uncertain | Compute but flag; consider more items |
60
+ | **Invalid** | Confidence does not discriminate | Do not interpret AUROC, ECE, selective prediction |
61
+
62
+ ## Indices
63
+
64
+ | Index | What it detects | Invalid threshold |
65
+ |-------|-----------------|-------------------|
66
+ | **L** | Blanket confidence on errors | >= 0.95 |
67
+ | **Fp** | Over-withdrawal of correct items | >= 0.50 |
68
+ | **RBS** | Inverted monitoring direction | > 0 (CI excludes zero) |
69
+ | **TRIN** | Fixed responding | >= 0.95 (warning only) |
70
+ | **r** | Item-level sensitivity | Reported, not thresholded |
71
+
72
+ ## Batch screening
73
+
74
+ ```python
75
+ from validity_screen import screen_batch, summary_table
76
+
77
+ models = {
78
+ "GPT-5.4": {"correct": correct_gpt, "confidence": conf_gpt},
79
+ "Claude": {"correct": correct_claude, "confidence": conf_claude},
80
+ "Gemini": {"correct": correct_gemini, "confidence": conf_gemini},
81
+ }
82
+
83
+ results = screen_batch(models, benchmark_name="MMLU")
84
+ print(summary_table(results))
85
+ ```
86
+
87
+ ## Continuous confidence
88
+
89
+ ```python
90
+ from validity_screen import screen, binarise
91
+
92
+ # Binarise at a fixed threshold
93
+ confidence_binary = binarise(confidence_continuous, threshold=50)
94
+
95
+ # Or at the sample median
96
+ confidence_binary = binarise(confidence_continuous, method='median')
97
+
98
+ result = screen(correct, confidence_binary)
99
+ ```
100
+
101
+ ## Requirements
102
+
103
+ - Python >= 3.8
104
+ - NumPy >= 1.20
105
+ - SciPy >= 1.7
106
+
107
+ ## Citation
108
+
109
+ ```bibtex
110
+ @article{cacioli2026screen,
111
+ title={Screen Before You Interpret: A Portable Validity Protocol for
112
+ Benchmark-Based LLM Confidence Signals},
113
+ author={Cacioli, Jon-Paul},
114
+ journal={arXiv preprint},
115
+ year={2026}
116
+ }
117
+
118
+ @article{cacioli2026validity,
119
+ title={Before You Interpret the Profile: Validity Scaling for
120
+ LLM Metacognitive Self-Report},
121
+ author={Cacioli, Jon-Paul},
122
+ journal={arXiv preprint},
123
+ year={2026}
124
+ }
125
+ ```
126
+
127
+ ## License
128
+
129
+ MIT
@@ -0,0 +1,47 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "validity-screen"
7
+ version = "0.1.0"
8
+ description = "Validity screening protocol for LLM confidence signals"
9
+ readme = "README.md"
10
+ license = {text = "MIT"}
11
+ requires-python = ">=3.8"
12
+ authors = [
13
+ {name = "Jon-Paul Cacioli"}
14
+ ]
15
+ keywords = [
16
+ "llm", "confidence", "validity", "metacognition",
17
+ "calibration", "selective-prediction", "screening",
18
+ "psychometrics", "evaluation"
19
+ ]
20
+ classifiers = [
21
+ "Development Status :: 3 - Alpha",
22
+ "Intended Audience :: Science/Research",
23
+ "License :: OSI Approved :: MIT License",
24
+ "Programming Language :: Python :: 3",
25
+ "Programming Language :: Python :: 3.8",
26
+ "Programming Language :: Python :: 3.9",
27
+ "Programming Language :: Python :: 3.10",
28
+ "Programming Language :: Python :: 3.11",
29
+ "Programming Language :: Python :: 3.12",
30
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
31
+ ]
32
+ dependencies = [
33
+ "numpy>=1.20",
34
+ "scipy>=1.7",
35
+ ]
36
+
37
+ [project.urls]
38
+ Homepage = "https://github.com/synthiumjp/validity-scaling-llm"
39
+ Documentation = "https://github.com/synthiumjp/validity-scaling-llm/tree/master/screen"
40
+ Repository = "https://github.com/synthiumjp/validity-scaling-llm"
41
+ Issues = "https://github.com/synthiumjp/validity-scaling-llm/issues"
42
+
43
+ [project.scripts]
44
+ validity-screen = "validity_screen.cli:main"
45
+
46
+ [tool.setuptools.packages.find]
47
+ include = ["validity_screen*"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,45 @@
1
+ """
2
+ validity-screen
3
+ ===============
4
+
5
+ Validity screening protocol for LLM confidence signals.
6
+
7
+ Checks whether a model's confidence signal carries item-level
8
+ information about correctness before downstream metrics (AUROC,
9
+ ECE, meta-d', selective prediction) are computed.
10
+
11
+ Quick start::
12
+
13
+ from validity_screen import screen
14
+
15
+ result = screen(correct, confidence, model_name="My Model")
16
+ print(result.tier) # 'Valid', 'Indeterminate', or 'Invalid'
17
+ print(result.vrs_table()) # Formatted reporting table
18
+
19
+ Reference:
20
+ Cacioli, J. P. (2026). Screen Before You Interpret: A Portable
21
+ Validity Protocol for Benchmark-Based LLM Confidence Signals.
22
+ """
23
+
24
+ from validity_screen.core import (
25
+ screen,
26
+ screen_batch,
27
+ summary_table,
28
+ binarise,
29
+ wilson_ci,
30
+ ScreenResult,
31
+ IndexResult,
32
+ )
33
+
34
+ __version__ = "0.1.0"
35
+ __author__ = "Jon-Paul Cacioli"
36
+
37
+ __all__ = [
38
+ "screen",
39
+ "screen_batch",
40
+ "summary_table",
41
+ "binarise",
42
+ "wilson_ci",
43
+ "ScreenResult",
44
+ "IndexResult",
45
+ ]
@@ -0,0 +1,162 @@
1
+ """
2
+ Command-line interface for validity-screen.
3
+
4
+ Usage:
5
+ validity-screen run --correct correct.csv --confidence confidence.csv
6
+ validity-screen run --data combined.csv --correct-col is_correct --confidence-col keep
7
+ validity-screen version
8
+ """
9
+
10
+ import argparse
11
+ import sys
12
+ import numpy as np
13
+
14
+ def main():
15
+ parser = argparse.ArgumentParser(
16
+ prog="validity-screen",
17
+ description="Validity screening protocol for LLM confidence signals.",
18
+ )
19
+ subparsers = parser.add_subparsers(dest="command")
20
+
21
+ # 'run' subcommand
22
+ run_parser = subparsers.add_parser("run", help="Run the validity screen on data.")
23
+ run_parser.add_argument(
24
+ "--correct", type=str, default=None,
25
+ help="Path to a file with correctness labels (one per line, True/False or 1/0)."
26
+ )
27
+ run_parser.add_argument(
28
+ "--confidence", type=str, default=None,
29
+ help="Path to a file with confidence labels (one per line, True/False or 1/0)."
30
+ )
31
+ run_parser.add_argument(
32
+ "--data", type=str, default=None,
33
+ help="Path to a CSV file with both correctness and confidence columns."
34
+ )
35
+ run_parser.add_argument(
36
+ "--correct-col", type=str, default="correct",
37
+ help="Column name for correctness in the CSV (default: 'correct')."
38
+ )
39
+ run_parser.add_argument(
40
+ "--confidence-col", type=str, default="confidence",
41
+ help="Column name for confidence in the CSV (default: 'confidence')."
42
+ )
43
+ run_parser.add_argument(
44
+ "--model-name", type=str, default="",
45
+ help="Model name for the VRS Table."
46
+ )
47
+ run_parser.add_argument(
48
+ "--benchmark", type=str, default="",
49
+ help="Benchmark name for the VRS Table."
50
+ )
51
+ run_parser.add_argument(
52
+ "--threshold", type=float, default=None,
53
+ help="Binarisation threshold for continuous confidence (default: median)."
54
+ )
55
+ run_parser.add_argument(
56
+ "--json", action="store_true",
57
+ help="Output results as JSON instead of VRS Table."
58
+ )
59
+
60
+ # 'version' subcommand
61
+ subparsers.add_parser("version", help="Print version and exit.")
62
+
63
+ args = parser.parse_args()
64
+
65
+ if args.command == "version":
66
+ from validity_screen import __version__
67
+ print(f"validity-screen {__version__}")
68
+ return
69
+
70
+ if args.command != "run":
71
+ parser.print_help()
72
+ return
73
+
74
+ # Load data
75
+ from validity_screen import screen, binarise
76
+ import json
77
+
78
+ try:
79
+ if args.data:
80
+ # CSV mode: one file with both columns
81
+ import csv
82
+ correct_vals = []
83
+ confidence_vals = []
84
+ with open(args.data, "r") as f:
85
+ reader = csv.DictReader(f)
86
+ for row in reader:
87
+ c = row[args.correct_col].strip().lower()
88
+ correct_vals.append(c in ("true", "1", "1.0", "yes"))
89
+ conf = row[args.confidence_col].strip().lower()
90
+ # Try numeric first, then boolean
91
+ try:
92
+ confidence_vals.append(float(conf))
93
+ except ValueError:
94
+ confidence_vals.append(conf in ("true", "1", "1.0", "yes", "keep", "bet"))
95
+
96
+ correct = np.array(correct_vals, dtype=bool)
97
+ confidence_raw = np.array(confidence_vals)
98
+
99
+ # Binarise if continuous
100
+ if confidence_raw.dtype == float and not np.all((confidence_raw == 0) | (confidence_raw == 1)):
101
+ threshold = args.threshold if args.threshold is not None else float(np.median(confidence_raw))
102
+ confidence = binarise(confidence_raw, threshold)
103
+ binarisation = f"threshold={threshold:.3f}"
104
+ else:
105
+ confidence = confidence_raw.astype(bool)
106
+ binarisation = "N/A (already binary)"
107
+
108
+ elif args.correct and args.confidence:
109
+ # Two-file mode
110
+ def load_column(path):
111
+ vals = []
112
+ with open(path, "r") as f:
113
+ for line in f:
114
+ line = line.strip().lower()
115
+ if not line:
116
+ continue
117
+ try:
118
+ vals.append(float(line))
119
+ except ValueError:
120
+ vals.append(line in ("true", "1", "yes"))
121
+ return np.array(vals)
122
+
123
+ correct = load_column(args.correct).astype(bool)
124
+ confidence_raw = load_column(args.confidence)
125
+
126
+ if confidence_raw.dtype == float and not np.all((confidence_raw == 0) | (confidence_raw == 1)):
127
+ threshold = args.threshold if args.threshold is not None else float(np.median(confidence_raw))
128
+ confidence = binarise(confidence_raw, threshold)
129
+ binarisation = f"threshold={threshold:.3f}"
130
+ else:
131
+ confidence = confidence_raw.astype(bool)
132
+ binarisation = "N/A (already binary)"
133
+ else:
134
+ print("Error: provide either --data or both --correct and --confidence.", file=sys.stderr)
135
+ sys.exit(1)
136
+
137
+ # Run screen
138
+ result = screen(
139
+ correct, confidence,
140
+ model_name=args.model_name,
141
+ benchmark_name=args.benchmark,
142
+ binarisation_threshold=binarisation,
143
+ )
144
+
145
+ if args.json:
146
+ print(json.dumps(result.to_dict(), indent=2))
147
+ else:
148
+ print(result.vrs_table())
149
+
150
+ except FileNotFoundError as e:
151
+ print(f"Error: {e}", file=sys.stderr)
152
+ sys.exit(1)
153
+ except KeyError as e:
154
+ print(f"Error: column {e} not found in CSV.", file=sys.stderr)
155
+ sys.exit(1)
156
+ except Exception as e:
157
+ print(f"Error: {e}", file=sys.stderr)
158
+ sys.exit(1)
159
+
160
+
161
+ if __name__ == "__main__":
162
+ main()
@@ -0,0 +1,465 @@
1
+ """
2
+ screen_before_you_interpret
3
+ ===========================
4
+
5
+ Validity screening protocol for LLM confidence signals.
6
+
7
+ Implements the Stage A screening sequence from:
8
+ Cacioli, J. P. (2026). Screen Before You Interpret: A Portable Validity
9
+ Protocol for Benchmark-Based LLM Confidence Signals. arXiv.
10
+
11
+ Usage:
12
+ from validity_screen import screen, vrs_table
13
+
14
+ result = screen(correct, confidence)
15
+ print(result.tier) # 'Valid', 'Indeterminate', or 'Invalid'
16
+ print(result.vrs_table()) # Formatted VRS Table
17
+ """
18
+
19
+ __version__ = "0.1.0"
20
+ __author__ = "Jon-Paul Cacioli"
21
+
22
+ import numpy as np
23
+ from dataclasses import dataclass, field
24
+ from typing import Optional, List, Tuple
25
+ from scipy import stats as sp_stats
26
+
27
+
28
+ # ============================================================
29
+ # Wilson score confidence interval
30
+ # ============================================================
31
+
32
+ def wilson_ci(k: int, n: int, alpha: float = 0.05) -> Tuple[float, float]:
33
+ """Wilson score interval for a binomial proportion.
34
+
35
+ Parameters
36
+ ----------
37
+ k : int
38
+ Number of successes.
39
+ n : int
40
+ Number of trials.
41
+ alpha : float
42
+ Significance level (default 0.05 for 95% CI).
43
+
44
+ Returns
45
+ -------
46
+ (lower, upper) : tuple of float
47
+ """
48
+ if n == 0:
49
+ return (0.0, 1.0)
50
+ p = k / n
51
+ z = sp_stats.norm.ppf(1 - alpha / 2)
52
+ denom = 1 + z**2 / n
53
+ centre = (p + z**2 / (2 * n)) / denom
54
+ spread = z * np.sqrt(p * (1 - p) / n + z**2 / (4 * n**2)) / denom
55
+ return (max(0.0, centre - spread), min(1.0, centre + spread))
56
+
57
+
58
+ # ============================================================
59
+ # Data classes for results
60
+ # ============================================================
61
+
62
+ @dataclass
63
+ class IndexResult:
64
+ """Result for a single validity index."""
65
+ name: str
66
+ value: float
67
+ ci_lower: float
68
+ ci_upper: float
69
+ threshold: Optional[float] = None
70
+ flag: str = "ok" # 'ok', 'invalid', 'indeterminate', 'warning'
71
+ note: str = ""
72
+
73
+
74
+ @dataclass
75
+ class ScreenResult:
76
+ """Complete screening result for one model."""
77
+ # Metadata
78
+ model_name: str = ""
79
+ benchmark_name: str = ""
80
+ n_items: int = 0
81
+ n_correct: int = 0
82
+ n_incorrect: int = 0
83
+ accuracy: float = 0.0
84
+ elicitation_method: str = ""
85
+ confidence_format: str = ""
86
+ binarisation_threshold: str = "N/A"
87
+ probe_timing: str = ""
88
+
89
+ # 2x2 table
90
+ a: int = 0 # correct + high confidence
91
+ b: int = 0 # incorrect + high confidence
92
+ c: int = 0 # correct + low confidence
93
+ d: int = 0 # incorrect + low confidence
94
+
95
+ # Index results
96
+ trin: Optional[IndexResult] = None
97
+ L: Optional[IndexResult] = None
98
+ Fp: Optional[IndexResult] = None
99
+ rbs: Optional[IndexResult] = None
100
+ r_conf_correct: Optional[IndexResult] = None
101
+
102
+ # Classification
103
+ tier: str = "" # 'Valid', 'Indeterminate', 'Invalid', 'Insufficient data'
104
+ flagging_reasons: List[str] = field(default_factory=list)
105
+ response_style: str = ""
106
+
107
+ # Cell count warning
108
+ min_cell: int = 0
109
+
110
+ def vrs_table(self) -> str:
111
+ """Return a formatted VRS Table string."""
112
+ rows = [
113
+ ("Model", self.model_name or "[not specified]"),
114
+ ("Benchmark", self.benchmark_name or "[not specified]"),
115
+ ("N items", str(self.n_items)),
116
+ ("N correct / N incorrect", f"{self.n_correct} / {self.n_incorrect}"),
117
+ ("Accuracy", f"{self.accuracy:.3f}"),
118
+ ("Confidence elicitation", self.elicitation_method or "[not specified]"),
119
+ ("Confidence format", self.confidence_format or "[not specified]"),
120
+ ("Binarisation threshold", self.binarisation_threshold),
121
+ ("Probe timing", self.probe_timing or "[not specified]"),
122
+ ("2x2 table", f"a={self.a}, b={self.b}, c={self.c}, d={self.d}"),
123
+ ]
124
+
125
+ if self.trin:
126
+ direction = "fixed-high" if self.a + self.b > self.c + self.d else "fixed-low"
127
+ warn = " — structural warning" if self.trin.value >= 0.95 else ""
128
+ rows.append(("TRIN", f"{self.trin.value:.3f} ({direction}){warn}"))
129
+
130
+ if self.L:
131
+ rows.append(("L", f"{self.L.value:.3f} [{self.L.ci_lower:.3f}, {self.L.ci_upper:.3f}]"))
132
+
133
+ if self.Fp:
134
+ rows.append(("Fp", f"{self.Fp.value:.3f} [{self.Fp.ci_lower:.3f}, {self.Fp.ci_upper:.3f}]"))
135
+
136
+ if self.rbs:
137
+ rows.append(("RBS", f"{self.rbs.value:+.3f} [{self.rbs.ci_lower:+.3f}, {self.rbs.ci_upper:+.3f}]"))
138
+
139
+ if self.r_conf_correct:
140
+ r = self.r_conf_correct
141
+ sig = "p < .001" if r.ci_lower > 0 or r.ci_upper < 0 else f"p = {r.threshold:.3f}" if r.threshold else ""
142
+ rows.append(("r(confidence, correct)",
143
+ f"{r.value:+.3f}, {sig}, 95% CI [{r.ci_lower:+.3f}, {r.ci_upper:+.3f}]"))
144
+
145
+ rows.append(("Tier classification", self.tier))
146
+ rows.append(("Flagging reason", "; ".join(self.flagging_reasons) if self.flagging_reasons else "None"))
147
+ if self.response_style:
148
+ rows.append(("Response style", self.response_style))
149
+
150
+ # Format
151
+ max_label = max(len(r[0]) for r in rows)
152
+ lines = []
153
+ lines.append("=" * (max_label + 50))
154
+ lines.append("VRS TABLE — Validity Report for Confidence Screening")
155
+ lines.append("=" * (max_label + 50))
156
+ for label, value in rows:
157
+ lines.append(f" {label:<{max_label}} {value}")
158
+ lines.append("=" * (max_label + 50))
159
+ return "\n".join(lines)
160
+
161
+ def to_dict(self) -> dict:
162
+ """Return a dictionary of all fields for serialisation."""
163
+ d = {
164
+ "model_name": self.model_name,
165
+ "benchmark_name": self.benchmark_name,
166
+ "n_items": self.n_items,
167
+ "n_correct": self.n_correct,
168
+ "n_incorrect": self.n_incorrect,
169
+ "accuracy": round(self.accuracy, 4),
170
+ "cell_a": self.a, "cell_b": self.b,
171
+ "cell_c": self.c, "cell_d": self.d,
172
+ "min_cell": self.min_cell,
173
+ "tier": self.tier,
174
+ "flagging_reasons": self.flagging_reasons,
175
+ "response_style": self.response_style,
176
+ }
177
+ for idx_name in ["trin", "L", "Fp", "rbs", "r_conf_correct"]:
178
+ idx = getattr(self, idx_name)
179
+ if idx:
180
+ d[f"{idx_name}_value"] = round(idx.value, 4)
181
+ d[f"{idx_name}_ci_lower"] = round(idx.ci_lower, 4)
182
+ d[f"{idx_name}_ci_upper"] = round(idx.ci_upper, 4)
183
+ d[f"{idx_name}_flag"] = idx.flag
184
+ return d
185
+
186
+
187
+ # ============================================================
188
+ # Main screening function
189
+ # ============================================================
190
+
191
+ def screen(
192
+ correct: np.ndarray,
193
+ confidence: np.ndarray,
194
+ model_name: str = "",
195
+ benchmark_name: str = "",
196
+ elicitation_method: str = "",
197
+ confidence_format: str = "",
198
+ binarisation_threshold: str = "N/A",
199
+ probe_timing: str = "",
200
+ alpha: float = 0.05,
201
+ ) -> ScreenResult:
202
+ """Run the Stage A validity screening protocol.
203
+
204
+ Parameters
205
+ ----------
206
+ correct : array-like of bool or 0/1
207
+ Whether each item was answered correctly.
208
+ confidence : array-like of bool or 0/1
209
+ Whether the model expressed high confidence on each item.
210
+ 1 = high confidence (KEEP / BET), 0 = low confidence (WITHDRAW / NO BET).
211
+ model_name : str, optional
212
+ Model identifier for the VRS Table.
213
+ benchmark_name : str, optional
214
+ Benchmark identifier for the VRS Table.
215
+ elicitation_method : str, optional
216
+ How confidence was elicited.
217
+ confidence_format : str, optional
218
+ Original format before binarisation.
219
+ binarisation_threshold : str, optional
220
+ Threshold used if applicable.
221
+ probe_timing : str, optional
222
+ Retrospective, prospective, concurrent.
223
+ alpha : float
224
+ Significance level for confidence intervals (default 0.05).
225
+
226
+ Returns
227
+ -------
228
+ ScreenResult
229
+ Complete screening result including tier classification and VRS Table.
230
+ """
231
+ correct = np.asarray(correct, dtype=bool)
232
+ confidence = np.asarray(confidence, dtype=bool)
233
+
234
+ if len(correct) != len(confidence):
235
+ raise ValueError(f"correct ({len(correct)}) and confidence ({len(confidence)}) must have same length")
236
+
237
+ n = len(correct)
238
+ n_correct = int(correct.sum())
239
+ n_incorrect = n - n_correct
240
+
241
+ # Build 2x2 table
242
+ a = int((correct & confidence).sum()) # correct + high conf
243
+ b = int((~correct & confidence).sum()) # incorrect + high conf
244
+ c = int((correct & ~confidence).sum()) # correct + low conf
245
+ d = int((~correct & ~confidence).sum()) # incorrect + low conf
246
+
247
+ result = ScreenResult(
248
+ model_name=model_name,
249
+ benchmark_name=benchmark_name,
250
+ n_items=n,
251
+ n_correct=n_correct,
252
+ n_incorrect=n_incorrect,
253
+ accuracy=n_correct / n if n > 0 else 0.0,
254
+ elicitation_method=elicitation_method,
255
+ confidence_format=confidence_format,
256
+ binarisation_threshold=binarisation_threshold,
257
+ probe_timing=probe_timing,
258
+ a=a, b=b, c=c, d=d,
259
+ min_cell=min(a, b, c, d),
260
+ )
261
+
262
+ flags = []
263
+
264
+ # ----------------------------------------------------------
265
+ # Step 1: Check cell counts
266
+ # ----------------------------------------------------------
267
+ if min(a, b, c, d) < 5:
268
+ result.tier = "Insufficient data"
269
+ result.flagging_reasons = [f"Cell count below 5 (min cell = {min(a, b, c, d)})"]
270
+ return result
271
+
272
+ # ----------------------------------------------------------
273
+ # Step 2: TRIN (structural indicator, not a Tier 1 flag)
274
+ # ----------------------------------------------------------
275
+ n_high = a + b
276
+ n_low = c + d
277
+ trin_val = max(n_high, n_low) / n
278
+ trin_flag = "warning" if trin_val >= 0.95 else "ok"
279
+ result.trin = IndexResult(
280
+ name="TRIN", value=trin_val,
281
+ ci_lower=trin_val, ci_upper=trin_val, # deterministic
282
+ threshold=0.95, flag=trin_flag,
283
+ note="Structural warning only; does not trigger Invalid"
284
+ )
285
+
286
+ # ----------------------------------------------------------
287
+ # Step 3: Fp = P(low confidence | correct)
288
+ # ----------------------------------------------------------
289
+ fp_val = c / n_correct if n_correct > 0 else 0.0
290
+ fp_lo, fp_hi = wilson_ci(c, n_correct, alpha)
291
+ fp_flag = "ok"
292
+ if fp_val >= 0.50 and fp_lo > 0.40:
293
+ fp_flag = "invalid"
294
+ flags.append(f"Fp = {fp_val:.3f} exceeds .50 (Wilson CI lower = {fp_lo:.3f} > .40)")
295
+ elif fp_val >= 0.50:
296
+ fp_flag = "indeterminate"
297
+ flags.append(f"Fp = {fp_val:.3f} at .50 but Wilson CI lower = {fp_lo:.3f} spans .40")
298
+ result.Fp = IndexResult(name="Fp", value=fp_val, ci_lower=fp_lo, ci_upper=fp_hi,
299
+ threshold=0.50, flag=fp_flag)
300
+
301
+ # ----------------------------------------------------------
302
+ # Step 4: L = P(high confidence | incorrect)
303
+ # ----------------------------------------------------------
304
+ l_val = b / n_incorrect if n_incorrect > 0 else 0.0
305
+ l_lo, l_hi = wilson_ci(b, n_incorrect, alpha)
306
+ l_flag = "ok"
307
+ if l_val >= 0.95 and l_lo > 0.90:
308
+ l_flag = "invalid"
309
+ flags.append(f"L = {l_val:.3f} exceeds .95 (Wilson CI lower = {l_lo:.3f} > .90)")
310
+ elif l_val >= 0.95:
311
+ l_flag = "indeterminate"
312
+ flags.append(f"L = {l_val:.3f} at .95 but Wilson CI lower = {l_lo:.3f} spans .90")
313
+ result.L = IndexResult(name="L", value=l_val, ci_lower=l_lo, ci_upper=l_hi,
314
+ threshold=0.95, flag=l_flag)
315
+
316
+ # ----------------------------------------------------------
317
+ # Step 5: RBS = Fp - (1 - L)
318
+ # ----------------------------------------------------------
319
+ rbs_val = fp_val - (1 - l_val)
320
+ # CI for RBS via component SEs
321
+ se_fp = np.sqrt(fp_val * (1 - fp_val) / n_correct) if n_correct > 0 else 0
322
+ se_l = np.sqrt(l_val * (1 - l_val) / n_incorrect) if n_incorrect > 0 else 0
323
+ se_rbs = np.sqrt(se_fp**2 + se_l**2)
324
+ z = sp_stats.norm.ppf(1 - alpha / 2)
325
+ rbs_lo = rbs_val - z * se_rbs
326
+ rbs_hi = rbs_val + z * se_rbs
327
+ rbs_flag = "ok"
328
+ if rbs_val > 0:
329
+ if rbs_lo > 0:
330
+ rbs_flag = "invalid"
331
+ flags.append(f"RBS = {rbs_val:+.3f}, CI [{rbs_lo:+.3f}, {rbs_hi:+.3f}] excludes zero")
332
+ else:
333
+ rbs_flag = "indeterminate"
334
+ flags.append(f"RBS = {rbs_val:+.3f}, CI [{rbs_lo:+.3f}, {rbs_hi:+.3f}] includes zero")
335
+ result.rbs = IndexResult(name="RBS", value=rbs_val, ci_lower=rbs_lo, ci_upper=rbs_hi,
336
+ threshold=0.0, flag=rbs_flag)
337
+
338
+ # ----------------------------------------------------------
339
+ # Step 6: r(confidence, correct) — point-biserial
340
+ # ----------------------------------------------------------
341
+ conf_int = confidence.astype(int)
342
+ corr_int = correct.astype(int)
343
+ r_val, p_val = sp_stats.pointbiserialr(conf_int, corr_int)
344
+ # Fisher z transform for CI
345
+ z_r = np.arctanh(r_val) if abs(r_val) < 1.0 else np.sign(r_val) * 3.0
346
+ se_z = 1.0 / np.sqrt(n - 3) if n > 3 else 1.0
347
+ z_crit = sp_stats.norm.ppf(1 - alpha / 2)
348
+ r_lo = np.tanh(z_r - z_crit * se_z)
349
+ r_hi = np.tanh(z_r + z_crit * se_z)
350
+ result.r_conf_correct = IndexResult(
351
+ name="r(confidence, correct)", value=r_val,
352
+ ci_lower=r_lo, ci_upper=r_hi,
353
+ threshold=p_val, flag="ok"
354
+ )
355
+
356
+ # ----------------------------------------------------------
357
+ # Classification
358
+ # ----------------------------------------------------------
359
+ has_invalid = any(idx_flag == "invalid" for idx_flag in
360
+ [result.Fp.flag, result.L.flag, result.rbs.flag])
361
+ has_indet = any(idx_flag == "indeterminate" for idx_flag in
362
+ [result.Fp.flag, result.L.flag, result.rbs.flag])
363
+
364
+ if has_invalid:
365
+ result.tier = "Invalid"
366
+ elif has_indet:
367
+ result.tier = "Indeterminate"
368
+ else:
369
+ result.tier = "Valid"
370
+
371
+ result.flagging_reasons = flags if flags else []
372
+
373
+ # Response style characterisation for non-Valid
374
+ if result.tier != "Valid":
375
+ styles = []
376
+ if result.L and result.L.value >= 0.95:
377
+ styles.append(f"blanket confidence on errors (L = {result.L.value:.3f})")
378
+ if result.Fp and result.Fp.value >= 0.50:
379
+ styles.append(f"excessive withdrawal of correct items (Fp = {result.Fp.value:.3f})")
380
+ if result.rbs and result.rbs.value > 0:
381
+ styles.append(f"inverted monitoring (RBS = {result.rbs.value:+.3f})")
382
+ if result.trin and result.trin.flag == "warning":
383
+ styles.append(f"near-total response dominance (TRIN = {result.trin.value:.3f})")
384
+ result.response_style = "; ".join(styles) if styles else "Near-threshold values"
385
+
386
+ return result
387
+
388
+
389
+ # ============================================================
390
+ # Batch screening
391
+ # ============================================================
392
+
393
+ def screen_batch(
394
+ models: dict,
395
+ benchmark_name: str = "",
396
+ **kwargs,
397
+ ) -> List[ScreenResult]:
398
+ """Screen multiple models.
399
+
400
+ Parameters
401
+ ----------
402
+ models : dict
403
+ Keys are model names, values are dicts with 'correct' and 'confidence' arrays.
404
+ benchmark_name : str, optional
405
+ Benchmark name for all models.
406
+
407
+ Returns
408
+ -------
409
+ list of ScreenResult
410
+ """
411
+ results = []
412
+ for name, data in models.items():
413
+ r = screen(
414
+ correct=data["correct"],
415
+ confidence=data["confidence"],
416
+ model_name=name,
417
+ benchmark_name=benchmark_name,
418
+ **kwargs,
419
+ )
420
+ results.append(r)
421
+ return results
422
+
423
+
424
+ def summary_table(results: List[ScreenResult]) -> str:
425
+ """Print a summary table of screening results."""
426
+ header = f"{'Model':<25} {'Tier':<15} {'L':>6} {'Fp':>6} {'RBS':>7} {'TRIN':>6} {'r':>7}"
427
+ sep = "-" * len(header)
428
+ lines = [sep, header, sep]
429
+ for r in sorted(results, key=lambda x: x.tier):
430
+ l_str = f"{r.L.value:.3f}" if r.L else "—"
431
+ fp_str = f"{r.Fp.value:.3f}" if r.Fp else "—"
432
+ rbs_str = f"{r.rbs.value:+.3f}" if r.rbs else "—"
433
+ trin_str = f"{r.trin.value:.3f}" if r.trin else "—"
434
+ r_str = f"{r.r_conf_correct.value:+.3f}" if r.r_conf_correct else "—"
435
+ lines.append(f"{r.model_name:<25} {r.tier:<15} {l_str:>6} {fp_str:>6} {rbs_str:>7} {trin_str:>6} {r_str:>7}")
436
+ lines.append(sep)
437
+ return "\n".join(lines)
438
+
439
+
440
+ # ============================================================
441
+ # Convenience: binarise continuous confidence
442
+ # ============================================================
443
+
444
+ def binarise(confidence: np.ndarray, threshold: float = 0.5, method: str = "fixed") -> np.ndarray:
445
+ """Binarise continuous confidence to high/low.
446
+
447
+ Parameters
448
+ ----------
449
+ confidence : array-like
450
+ Continuous confidence values.
451
+ threshold : float
452
+ Threshold value (default 0.5).
453
+ method : str
454
+ 'fixed' uses the threshold directly.
455
+ 'median' uses the sample median.
456
+
457
+ Returns
458
+ -------
459
+ np.ndarray of bool
460
+ True = high confidence, False = low confidence.
461
+ """
462
+ confidence = np.asarray(confidence, dtype=float)
463
+ if method == "median":
464
+ threshold = np.median(confidence)
465
+ return confidence >= threshold
@@ -0,0 +1,157 @@
1
+ Metadata-Version: 2.4
2
+ Name: validity-screen
3
+ Version: 0.1.0
4
+ Summary: Validity screening protocol for LLM confidence signals
5
+ Author: Jon-Paul Cacioli
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/synthiumjp/validity-scaling-llm
8
+ Project-URL: Documentation, https://github.com/synthiumjp/validity-scaling-llm/tree/master/screen
9
+ Project-URL: Repository, https://github.com/synthiumjp/validity-scaling-llm
10
+ Project-URL: Issues, https://github.com/synthiumjp/validity-scaling-llm/issues
11
+ Keywords: llm,confidence,validity,metacognition,calibration,selective-prediction,screening,psychometrics,evaluation
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Requires-Python: >=3.8
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: numpy>=1.20
26
+ Requires-Dist: scipy>=1.7
27
+ Dynamic: license-file
28
+
29
+ # validity-screen
30
+
31
+ **Check whether an LLM's confidence signal carries information before you build on it.**
32
+
33
+ [![PyPI](https://img.shields.io/pypi/v/validity-screen)](https://pypi.org/project/validity-screen/)
34
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
35
+
36
+ Implements the screening protocol from:
37
+
38
+ > Cacioli, J. P. (2026). *Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals.* arXiv.
39
+
40
+ ## Install
41
+
42
+ ```bash
43
+ pip install validity-screen
44
+ ```
45
+
46
+ ## Quick start (Python)
47
+
48
+ ```python
49
+ import numpy as np
50
+ from validity_screen import screen
51
+
52
+ # Your data: item-level correctness and confidence
53
+ correct = np.array([True, True, False, True, False, True, True, False])
54
+ confidence = np.array([True, True, True, True, False, True, False, False])
55
+
56
+ result = screen(correct, confidence, model_name="My Model")
57
+
58
+ print(result.tier) # 'Valid', 'Indeterminate', or 'Invalid'
59
+ print(result.vrs_table()) # Complete reporting table
60
+ ```
61
+
62
+ ## Quick start (command line)
63
+
64
+ ```bash
65
+ # From a CSV with 'correct' and 'confidence' columns
66
+ validity-screen run --data my_data.csv --model-name "GPT-5.4"
67
+
68
+ # From separate files
69
+ validity-screen run --correct correct.txt --confidence confidence.txt
70
+
71
+ # Continuous confidence? Auto-binarised at median
72
+ validity-screen run --data my_data.csv --confidence-col prob --threshold 0.5
73
+
74
+ # JSON output for pipelines
75
+ validity-screen run --data my_data.csv --json
76
+ ```
77
+
78
+ ## What it does
79
+
80
+ Before computing calibration metrics (ECE), metacognitive sensitivity (meta-d', AUROC), or selective prediction accuracy, this protocol checks whether the confidence signal carries item-level information about correctness. If it doesn't, those downstream metrics are fitting noise.
81
+
82
+ Five values from a 2x2 contingency table. Three possible outcomes.
83
+
84
+ | Tier | Meaning | Action |
85
+ |------|---------|--------|
86
+ | **Valid** | Confidence tracks correctness | Proceed with downstream metrics |
87
+ | **Indeterminate** | Near threshold, uncertain | Compute but flag; consider more items |
88
+ | **Invalid** | Confidence does not discriminate | Do not interpret AUROC, ECE, selective prediction |
89
+
90
+ ## Indices
91
+
92
+ | Index | What it detects | Invalid threshold |
93
+ |-------|-----------------|-------------------|
94
+ | **L** | Blanket confidence on errors | >= 0.95 |
95
+ | **Fp** | Over-withdrawal of correct items | >= 0.50 |
96
+ | **RBS** | Inverted monitoring direction | > 0 (CI excludes zero) |
97
+ | **TRIN** | Fixed responding | >= 0.95 (warning only) |
98
+ | **r** | Item-level sensitivity | Reported, not thresholded |
99
+
100
+ ## Batch screening
101
+
102
+ ```python
103
+ from validity_screen import screen_batch, summary_table
104
+
105
+ models = {
106
+ "GPT-5.4": {"correct": correct_gpt, "confidence": conf_gpt},
107
+ "Claude": {"correct": correct_claude, "confidence": conf_claude},
108
+ "Gemini": {"correct": correct_gemini, "confidence": conf_gemini},
109
+ }
110
+
111
+ results = screen_batch(models, benchmark_name="MMLU")
112
+ print(summary_table(results))
113
+ ```
114
+
115
+ ## Continuous confidence
116
+
117
+ ```python
118
+ from validity_screen import screen, binarise
119
+
120
+ # Binarise at a fixed threshold
121
+ confidence_binary = binarise(confidence_continuous, threshold=50)
122
+
123
+ # Or at the sample median
124
+ confidence_binary = binarise(confidence_continuous, method='median')
125
+
126
+ result = screen(correct, confidence_binary)
127
+ ```
128
+
129
+ ## Requirements
130
+
131
+ - Python >= 3.8
132
+ - NumPy >= 1.20
133
+ - SciPy >= 1.7
134
+
135
+ ## Citation
136
+
137
+ ```bibtex
138
+ @article{cacioli2026screen,
139
+ title={Screen Before You Interpret: A Portable Validity Protocol for
140
+ Benchmark-Based LLM Confidence Signals},
141
+ author={Cacioli, Jon-Paul},
142
+ journal={arXiv preprint},
143
+ year={2026}
144
+ }
145
+
146
+ @article{cacioli2026validity,
147
+ title={Before You Interpret the Profile: Validity Scaling for
148
+ LLM Metacognitive Self-Report},
149
+ author={Cacioli, Jon-Paul},
150
+ journal={arXiv preprint},
151
+ year={2026}
152
+ }
153
+ ```
154
+
155
+ ## License
156
+
157
+ MIT
@@ -0,0 +1,12 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ validity_screen/__init__.py
5
+ validity_screen/cli.py
6
+ validity_screen/core.py
7
+ validity_screen.egg-info/PKG-INFO
8
+ validity_screen.egg-info/SOURCES.txt
9
+ validity_screen.egg-info/dependency_links.txt
10
+ validity_screen.egg-info/entry_points.txt
11
+ validity_screen.egg-info/requires.txt
12
+ validity_screen.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ validity-screen = validity_screen.cli:main
@@ -0,0 +1,2 @@
1
+ numpy>=1.20
2
+ scipy>=1.7
@@ -0,0 +1 @@
1
+ validity_screen