holdout-evals 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,3 @@
1
+ * text=auto eol=lf
2
+ *.png binary
3
+ *.jpg binary
@@ -0,0 +1,21 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+ *.egg
7
+ build/
8
+ dist/
9
+ .pytest_cache/
10
+ .mypy_cache/
11
+ .ruff_cache/
12
+
13
+ # envs
14
+ .venv/
15
+ venv/
16
+ env/
17
+
18
+ # os / editors
19
+ .DS_Store
20
+ .idea/
21
+ .vscode/
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Jordan Baillie
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,111 @@
1
+ Metadata-Version: 2.4
2
+ Name: holdout-evals
3
+ Version: 0.1.0
4
+ Summary: An independent significance referee for LLM & agent evals — is your improvement real, or noise?
5
+ Project-URL: Homepage, https://holdout.dev
6
+ Project-URL: Source, https://github.com/jordan-baillie/holdout
7
+ Author: Jordan Baillie
8
+ License: MIT
9
+ License-File: LICENSE
10
+ Keywords: ab-testing,evals,evaluation,llm,mcnemar,overfitting,permutation-test,significance,statistics
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
17
+ Requires-Python: >=3.9
18
+ Requires-Dist: numpy>=1.21
19
+ Provides-Extra: dev
20
+ Requires-Dist: pytest>=7; extra == 'dev'
21
+ Description-Content-Type: text/markdown
22
+
23
+ # holdout
24
+
25
+ **An independent significance referee for LLM & agent evals.** Is your improvement real — or
26
+ noise, multiple-comparisons inflation, or a model that quietly memorized your test set?
27
+
28
+ Most eval "wins" don't survive a paired significance test. `holdout` runs the three checks your
29
+ eval dashboard skips, in your code or in CI:
30
+
31
+ 1. **Is it signal?** A *paired* test (exact McNemar for pass/fail, paired permutation for graded
32
+ scores) with a real confidence interval — not a naked delta.
33
+ 2. **Or did you just try a lot of things?** The bar rises with how many variants you tried. The
34
+ max of 37 noisy attempts is *expected* to look like a win.
35
+ 3. **What would change the verdict?** Power analysis: how many tasks you'd actually need.
36
+
37
+ The stats are open source (this repo). The hosted service ([holdout.dev](https://holdout.dev))
38
+ adds the parts code can't promise: **independence**, a **write-once holdout you can't re-tune
39
+ against**, a contamination scan, and a verifiable badge.
40
+
41
+ ## Install
42
+
43
+ ```bash
44
+ pip install holdout-evals # the import name is still `import holdout`
45
+ ```
46
+
47
+ ## Quickstart — Python
48
+
49
+ ```python
50
+ from holdout import compare
51
+
52
+ # per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
53
+ res = compare(baseline_scores, candidate_scores, variants_tried=37)
54
+
55
+ print(res.report())
56
+ print(res.significant) # False — gate on this
57
+ print(res.p_value, res.ci) # the honest numbers
58
+ ```
59
+
60
+ ## Quickstart — CLI (drop it in CI)
61
+
62
+ ```bash
63
+ python examples/make_example.py # writes a +4-point "win" that is actually noise
64
+
65
+ holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
66
+ ```
67
+
68
+ ```
69
+ Holdout - significance check [FAIL]
70
+ baseline 73.0% -> candidate 77.0% (n = 200 tasks)
71
+ effect +4.0 pts 95% CI [-0.5, +8.5]
72
+ test mcnemar_exact p = 0.134
73
+ variants tried 37 -> adjusted p = 1.000 (any-false-win risk 85%)
74
+ paired counts +15 fixed / -7 broke (net +8)
75
+
76
+ VERDICT: WITHIN NOISE - not statistically significant.
77
+ -> Don't ship on this alone; the gain is indistinguishable from sampling noise.
78
+ You'd need ~967 tasks for an effect this size to be detectable.
79
+ ```
80
+
81
+ `holdout check` **exits non-zero** when the improvement isn't a significant gain — so it blocks a
82
+ "ship the noise" merge. As a GitHub Action:
83
+
84
+ ```yaml
85
+ - run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}
86
+ ```
87
+
88
+ Input is JSONL of `{ "task_id": ..., "score": ... }` (also accepts `correct`/`pass`/`reward`;
89
+ booleans and 0/1 become 0.0/1.0). One file per system, joined on `task_id` — or a single
90
+ `--paired` file with `baseline` and `candidate` columns.
91
+
92
+ ## How many tasks do I need?
93
+
94
+ ```bash
95
+ holdout power --baseline-acc 0.75 --effect 0.03 --variants 37
96
+ ```
97
+
98
+ ## Why not just compute it yourself?
99
+
100
+ You can — that's why the math is free. The point of the [hosted service](https://holdout.dev) is
101
+ the four things a local script can't credibly promise: an **independent** verdict (we didn't build
102
+ the agent), a **write-once holdout** scored exactly once per config (no quiet re-tuning), a
103
+ **variants bar that spans your whole team's submissions**, and a **verifiable badge**.
104
+
105
+ ## Reading
106
+
107
+ The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998),
108
+ multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al.
109
+ 2024, *GSM1k*), and power for evals (Miller 2024, *Adding Error Bars to Evals*).
110
+
111
+ MIT licensed. Contributions and corrections welcome — that's the whole point.
@@ -0,0 +1,89 @@
1
+ # holdout
2
+
3
+ **An independent significance referee for LLM & agent evals.** Is your improvement real — or
4
+ noise, multiple-comparisons inflation, or a model that quietly memorized your test set?
5
+
6
+ Most eval "wins" don't survive a paired significance test. `holdout` runs the three checks your
7
+ eval dashboard skips, in your code or in CI:
8
+
9
+ 1. **Is it signal?** A *paired* test (exact McNemar for pass/fail, paired permutation for graded
10
+ scores) with a real confidence interval — not a naked delta.
11
+ 2. **Or did you just try a lot of things?** The bar rises with how many variants you tried. The
12
+ max of 37 noisy attempts is *expected* to look like a win.
13
+ 3. **What would change the verdict?** Power analysis: how many tasks you'd actually need.
14
+
15
+ The stats are open source (this repo). The hosted service ([holdout.dev](https://holdout.dev))
16
+ adds the parts code can't promise: **independence**, a **write-once holdout you can't re-tune
17
+ against**, a contamination scan, and a verifiable badge.
18
+
19
+ ## Install
20
+
21
+ ```bash
22
+ pip install holdout-evals # the import name is still `import holdout`
23
+ ```
24
+
25
+ ## Quickstart — Python
26
+
27
+ ```python
28
+ from holdout import compare
29
+
30
+ # per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
31
+ res = compare(baseline_scores, candidate_scores, variants_tried=37)
32
+
33
+ print(res.report())
34
+ print(res.significant) # False — gate on this
35
+ print(res.p_value, res.ci) # the honest numbers
36
+ ```
37
+
38
+ ## Quickstart — CLI (drop it in CI)
39
+
40
+ ```bash
41
+ python examples/make_example.py # writes a +4-point "win" that is actually noise
42
+
43
+ holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
44
+ ```
45
+
46
+ ```
47
+ Holdout - significance check [FAIL]
48
+ baseline 73.0% -> candidate 77.0% (n = 200 tasks)
49
+ effect +4.0 pts 95% CI [-0.5, +8.5]
50
+ test mcnemar_exact p = 0.134
51
+ variants tried 37 -> adjusted p = 1.000 (any-false-win risk 85%)
52
+ paired counts +15 fixed / -7 broke (net +8)
53
+
54
+ VERDICT: WITHIN NOISE - not statistically significant.
55
+ -> Don't ship on this alone; the gain is indistinguishable from sampling noise.
56
+ You'd need ~967 tasks for an effect this size to be detectable.
57
+ ```
58
+
59
+ `holdout check` **exits non-zero** when the improvement isn't a significant gain — so it blocks a
60
+ "ship the noise" merge. As a GitHub Action:
61
+
62
+ ```yaml
63
+ - run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}
64
+ ```
65
+
66
+ Input is JSONL of `{ "task_id": ..., "score": ... }` (also accepts `correct`/`pass`/`reward`;
67
+ booleans and 0/1 become 0.0/1.0). One file per system, joined on `task_id` — or a single
68
+ `--paired` file with `baseline` and `candidate` columns.
69
+
70
+ ## How many tasks do I need?
71
+
72
+ ```bash
73
+ holdout power --baseline-acc 0.75 --effect 0.03 --variants 37
74
+ ```
75
+
76
+ ## Why not just compute it yourself?
77
+
78
+ You can — that's why the math is free. The point of the [hosted service](https://holdout.dev) is
79
+ the four things a local script can't credibly promise: an **independent** verdict (we didn't build
80
+ the agent), a **write-once holdout** scored exactly once per config (no quiet re-tuning), a
81
+ **variants bar that spans your whole team's submissions**, and a **verifiable badge**.
82
+
83
+ ## Reading
84
+
85
+ The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998),
86
+ multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al.
87
+ 2024, *GSM1k*), and power for evals (Miller 2024, *Adding Error Bars to Evals*).
88
+
89
+ MIT licensed. Contributions and corrections welcome — that's the whole point.
@@ -0,0 +1,37 @@
1
+ """Generate the worked example: a +4-point 'win' on 200 tasks that is actually noise.
2
+
3
+ Reproduces the case from the launch essay. 200 tasks:
4
+ 139 both right, 39 both wrong, 15 baseline-wrong/candidate-right (fixed), 7 the reverse (broke).
5
+ baseline = 146/200 = 73.0%, candidate = 154/200 = 77.0%, net +4.0 pts.
6
+ McNemar on the 15 vs 7 discordant pairs gives p ~ 0.13 — not significant — and after the
7
+ 37 variants the team tried, it isn't close.
8
+
9
+ Run: python examples/make_example.py
10
+ Then: holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
11
+ """
12
+ import json
13
+ from pathlib import Path
14
+
15
+ HERE = Path(__file__).parent
16
+
17
+ # (baseline, candidate, count)
18
+ GROUPS = [(1, 1, 139), (0, 0, 39), (0, 1, 15), (1, 0, 7)]
19
+
20
+
21
+ def main():
22
+ v1, v2 = [], []
23
+ i = 0
24
+ for base, cand, count in GROUPS:
25
+ for _ in range(count):
26
+ tid = f"t{i:03d}"
27
+ v1.append({"task_id": tid, "score": base})
28
+ v2.append({"task_id": tid, "score": cand})
29
+ i += 1
30
+ (HERE / "v1.jsonl").write_text("\n".join(json.dumps(r) for r in v1) + "\n", encoding="utf-8")
31
+ (HERE / "v2.jsonl").write_text("\n".join(json.dumps(r) for r in v2) + "\n", encoding="utf-8")
32
+ print(f"wrote {i} tasks -> examples/v1.jsonl, examples/v2.jsonl "
33
+ f"(baseline 73.0%, candidate 77.0%, +4.0 pts)")
34
+
35
+
36
+ if __name__ == "__main__":
37
+ main()
@@ -0,0 +1,200 @@
1
+ {"task_id": "t000", "score": 1}
2
+ {"task_id": "t001", "score": 1}
3
+ {"task_id": "t002", "score": 1}
4
+ {"task_id": "t003", "score": 1}
5
+ {"task_id": "t004", "score": 1}
6
+ {"task_id": "t005", "score": 1}
7
+ {"task_id": "t006", "score": 1}
8
+ {"task_id": "t007", "score": 1}
9
+ {"task_id": "t008", "score": 1}
10
+ {"task_id": "t009", "score": 1}
11
+ {"task_id": "t010", "score": 1}
12
+ {"task_id": "t011", "score": 1}
13
+ {"task_id": "t012", "score": 1}
14
+ {"task_id": "t013", "score": 1}
15
+ {"task_id": "t014", "score": 1}
16
+ {"task_id": "t015", "score": 1}
17
+ {"task_id": "t016", "score": 1}
18
+ {"task_id": "t017", "score": 1}
19
+ {"task_id": "t018", "score": 1}
20
+ {"task_id": "t019", "score": 1}
21
+ {"task_id": "t020", "score": 1}
22
+ {"task_id": "t021", "score": 1}
23
+ {"task_id": "t022", "score": 1}
24
+ {"task_id": "t023", "score": 1}
25
+ {"task_id": "t024", "score": 1}
26
+ {"task_id": "t025", "score": 1}
27
+ {"task_id": "t026", "score": 1}
28
+ {"task_id": "t027", "score": 1}
29
+ {"task_id": "t028", "score": 1}
30
+ {"task_id": "t029", "score": 1}
31
+ {"task_id": "t030", "score": 1}
32
+ {"task_id": "t031", "score": 1}
33
+ {"task_id": "t032", "score": 1}
34
+ {"task_id": "t033", "score": 1}
35
+ {"task_id": "t034", "score": 1}
36
+ {"task_id": "t035", "score": 1}
37
+ {"task_id": "t036", "score": 1}
38
+ {"task_id": "t037", "score": 1}
39
+ {"task_id": "t038", "score": 1}
40
+ {"task_id": "t039", "score": 1}
41
+ {"task_id": "t040", "score": 1}
42
+ {"task_id": "t041", "score": 1}
43
+ {"task_id": "t042", "score": 1}
44
+ {"task_id": "t043", "score": 1}
45
+ {"task_id": "t044", "score": 1}
46
+ {"task_id": "t045", "score": 1}
47
+ {"task_id": "t046", "score": 1}
48
+ {"task_id": "t047", "score": 1}
49
+ {"task_id": "t048", "score": 1}
50
+ {"task_id": "t049", "score": 1}
51
+ {"task_id": "t050", "score": 1}
52
+ {"task_id": "t051", "score": 1}
53
+ {"task_id": "t052", "score": 1}
54
+ {"task_id": "t053", "score": 1}
55
+ {"task_id": "t054", "score": 1}
56
+ {"task_id": "t055", "score": 1}
57
+ {"task_id": "t056", "score": 1}
58
+ {"task_id": "t057", "score": 1}
59
+ {"task_id": "t058", "score": 1}
60
+ {"task_id": "t059", "score": 1}
61
+ {"task_id": "t060", "score": 1}
62
+ {"task_id": "t061", "score": 1}
63
+ {"task_id": "t062", "score": 1}
64
+ {"task_id": "t063", "score": 1}
65
+ {"task_id": "t064", "score": 1}
66
+ {"task_id": "t065", "score": 1}
67
+ {"task_id": "t066", "score": 1}
68
+ {"task_id": "t067", "score": 1}
69
+ {"task_id": "t068", "score": 1}
70
+ {"task_id": "t069", "score": 1}
71
+ {"task_id": "t070", "score": 1}
72
+ {"task_id": "t071", "score": 1}
73
+ {"task_id": "t072", "score": 1}
74
+ {"task_id": "t073", "score": 1}
75
+ {"task_id": "t074", "score": 1}
76
+ {"task_id": "t075", "score": 1}
77
+ {"task_id": "t076", "score": 1}
78
+ {"task_id": "t077", "score": 1}
79
+ {"task_id": "t078", "score": 1}
80
+ {"task_id": "t079", "score": 1}
81
+ {"task_id": "t080", "score": 1}
82
+ {"task_id": "t081", "score": 1}
83
+ {"task_id": "t082", "score": 1}
84
+ {"task_id": "t083", "score": 1}
85
+ {"task_id": "t084", "score": 1}
86
+ {"task_id": "t085", "score": 1}
87
+ {"task_id": "t086", "score": 1}
88
+ {"task_id": "t087", "score": 1}
89
+ {"task_id": "t088", "score": 1}
90
+ {"task_id": "t089", "score": 1}
91
+ {"task_id": "t090", "score": 1}
92
+ {"task_id": "t091", "score": 1}
93
+ {"task_id": "t092", "score": 1}
94
+ {"task_id": "t093", "score": 1}
95
+ {"task_id": "t094", "score": 1}
96
+ {"task_id": "t095", "score": 1}
97
+ {"task_id": "t096", "score": 1}
98
+ {"task_id": "t097", "score": 1}
99
+ {"task_id": "t098", "score": 1}
100
+ {"task_id": "t099", "score": 1}
101
+ {"task_id": "t100", "score": 1}
102
+ {"task_id": "t101", "score": 1}
103
+ {"task_id": "t102", "score": 1}
104
+ {"task_id": "t103", "score": 1}
105
+ {"task_id": "t104", "score": 1}
106
+ {"task_id": "t105", "score": 1}
107
+ {"task_id": "t106", "score": 1}
108
+ {"task_id": "t107", "score": 1}
109
+ {"task_id": "t108", "score": 1}
110
+ {"task_id": "t109", "score": 1}
111
+ {"task_id": "t110", "score": 1}
112
+ {"task_id": "t111", "score": 1}
113
+ {"task_id": "t112", "score": 1}
114
+ {"task_id": "t113", "score": 1}
115
+ {"task_id": "t114", "score": 1}
116
+ {"task_id": "t115", "score": 1}
117
+ {"task_id": "t116", "score": 1}
118
+ {"task_id": "t117", "score": 1}
119
+ {"task_id": "t118", "score": 1}
120
+ {"task_id": "t119", "score": 1}
121
+ {"task_id": "t120", "score": 1}
122
+ {"task_id": "t121", "score": 1}
123
+ {"task_id": "t122", "score": 1}
124
+ {"task_id": "t123", "score": 1}
125
+ {"task_id": "t124", "score": 1}
126
+ {"task_id": "t125", "score": 1}
127
+ {"task_id": "t126", "score": 1}
128
+ {"task_id": "t127", "score": 1}
129
+ {"task_id": "t128", "score": 1}
130
+ {"task_id": "t129", "score": 1}
131
+ {"task_id": "t130", "score": 1}
132
+ {"task_id": "t131", "score": 1}
133
+ {"task_id": "t132", "score": 1}
134
+ {"task_id": "t133", "score": 1}
135
+ {"task_id": "t134", "score": 1}
136
+ {"task_id": "t135", "score": 1}
137
+ {"task_id": "t136", "score": 1}
138
+ {"task_id": "t137", "score": 1}
139
+ {"task_id": "t138", "score": 1}
140
+ {"task_id": "t139", "score": 0}
141
+ {"task_id": "t140", "score": 0}
142
+ {"task_id": "t141", "score": 0}
143
+ {"task_id": "t142", "score": 0}
144
+ {"task_id": "t143", "score": 0}
145
+ {"task_id": "t144", "score": 0}
146
+ {"task_id": "t145", "score": 0}
147
+ {"task_id": "t146", "score": 0}
148
+ {"task_id": "t147", "score": 0}
149
+ {"task_id": "t148", "score": 0}
150
+ {"task_id": "t149", "score": 0}
151
+ {"task_id": "t150", "score": 0}
152
+ {"task_id": "t151", "score": 0}
153
+ {"task_id": "t152", "score": 0}
154
+ {"task_id": "t153", "score": 0}
155
+ {"task_id": "t154", "score": 0}
156
+ {"task_id": "t155", "score": 0}
157
+ {"task_id": "t156", "score": 0}
158
+ {"task_id": "t157", "score": 0}
159
+ {"task_id": "t158", "score": 0}
160
+ {"task_id": "t159", "score": 0}
161
+ {"task_id": "t160", "score": 0}
162
+ {"task_id": "t161", "score": 0}
163
+ {"task_id": "t162", "score": 0}
164
+ {"task_id": "t163", "score": 0}
165
+ {"task_id": "t164", "score": 0}
166
+ {"task_id": "t165", "score": 0}
167
+ {"task_id": "t166", "score": 0}
168
+ {"task_id": "t167", "score": 0}
169
+ {"task_id": "t168", "score": 0}
170
+ {"task_id": "t169", "score": 0}
171
+ {"task_id": "t170", "score": 0}
172
+ {"task_id": "t171", "score": 0}
173
+ {"task_id": "t172", "score": 0}
174
+ {"task_id": "t173", "score": 0}
175
+ {"task_id": "t174", "score": 0}
176
+ {"task_id": "t175", "score": 0}
177
+ {"task_id": "t176", "score": 0}
178
+ {"task_id": "t177", "score": 0}
179
+ {"task_id": "t178", "score": 0}
180
+ {"task_id": "t179", "score": 0}
181
+ {"task_id": "t180", "score": 0}
182
+ {"task_id": "t181", "score": 0}
183
+ {"task_id": "t182", "score": 0}
184
+ {"task_id": "t183", "score": 0}
185
+ {"task_id": "t184", "score": 0}
186
+ {"task_id": "t185", "score": 0}
187
+ {"task_id": "t186", "score": 0}
188
+ {"task_id": "t187", "score": 0}
189
+ {"task_id": "t188", "score": 0}
190
+ {"task_id": "t189", "score": 0}
191
+ {"task_id": "t190", "score": 0}
192
+ {"task_id": "t191", "score": 0}
193
+ {"task_id": "t192", "score": 0}
194
+ {"task_id": "t193", "score": 1}
195
+ {"task_id": "t194", "score": 1}
196
+ {"task_id": "t195", "score": 1}
197
+ {"task_id": "t196", "score": 1}
198
+ {"task_id": "t197", "score": 1}
199
+ {"task_id": "t198", "score": 1}
200
+ {"task_id": "t199", "score": 1}