PyPI - pdfhell - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

pdfhell 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

{pdfhell-0.1.0 → pdfhell-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pdfhell
-Version: 0.1.0
+Version: 0.1.2
 Summary: PDF Hell — adversarial PDFs that break AI document readers. Procedural ground truth, not LLM-as-judge.
 Author: Multivon
 License: Apache-2.0
@@ -29,38 +29,68 @@ Dynamic: license-file
 # PDF Hell
-**Adversarial PDFs that break AI document readers — with procedural ground truth, not LLM-as-judge.**
+**Adversarial PDFs that stress-test AI document readers — with procedural ground truth, not LLM-as-judge.**
-PDF Hell is a small, sharp benchmark for the "AI reads PDFs" claim. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same loop that fooled the model isn't asked to grade it.
+PDF Hell is a small, focused benchmark for three specific failure modes in AI document pipelines. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same complexity that fools the model isn't asked to grade it.
-If your AI claims it can read documents, it should survive PDFs designed to break it.
+## The headline finding (mini-v1, 30 cases, 2026-05-17)
+GPT-4o falls for the hidden-OCR trap on **10 out of 10 cases (95% Wilson CI [72%, 100%])** — it consistently returns the *invisible* amount from the PDF's text layer instead of the *visible* amount rendered on the page:
+```
+Trap: hidden_ocr_mismatch (invoice — visible total $12,345.67, hidden OCR total $22,345.67)
+Question: What is the TOTAL AMOUNT DUE?
+→ openai:gpt-4o            $22,345.67   ← fell for trap (10/10 in this trap family)
+→ openai:gpt-5.4-mini      $22,345.67   ← fell for trap (9/10)
+→ openai:gpt-5.4           $12,345.67   ← correct (8/10 across trap)
+→ google:gemini-2.5-flash  $12,345.67   ← correct (10/10)
+→ anthropic:claude-sonnet-4-6  $12,345.67   ← correct (10/10)
+```
+The visible page, the hidden text layer, and an agent that fuses both will give three different answers. pdfhell exists to catch that.
 ## Quickstart (30 seconds)
 ```bash
-# 3-case smoke run against the cheapest vision model — works in any env with a Gemini key
+# 3-case smoke run against the cheapest vision model
 export GOOGLE_API_KEY=...
 uvx pdfhell run --model google:gemini-2.5-flash --suite smoke
-# Or run the full mini suite (30 cases, ~10s on Flash, ~$0.01)
+# Or the full mini-v1 suite (30 cases, ~10s on Flash, ~$0.01)
 uvx pdfhell run --model anthropic:claude-sonnet-4-6 --suite mini
-# Or just generate one trap PDF and open it
+# Or generate one trap PDF and inspect it
 uvx pdfhell make --trap hidden_ocr_mismatch --seed 42
 open ./cases/hidden_ocr_mismatch-0042.pdf
 ```
-That's it. `pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
+`pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
-Smoke result on Gemini 2.5 Flash (one case per family, run this minute):
+## Mini-v1 leaderboard (8 models, 30 cases)
-```
-PDF Hell smoke suite — n=3
-model: google:gemini-2.5-flash
-pass: 3/3  (100.0%)
-```
+| Model | Pass rate | 95% CI | Hidden OCR | Footnote | Split table |
+|---|---:|---:|---:|---:|---:|
+| `anthropic:claude-sonnet-4-6` | 29/30 (97%) | [83%, 99%] | 10/10 | 9/10 | 10/10 |
+| `google:gemini-3.1-pro-preview` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-3.1-flash-lite` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-pro` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-flash` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `openai:gpt-5.4` | 27/30 (90%) | [74%, 97%] | 8/10 | 9/10 | 10/10 |
+| `openai:gpt-5.4-mini` | 20/30 (67%) | [49%, 81%] | 1/10 | 9/10 | 10/10 |
+| `openai:gpt-4o` | 14/30 (47%) | [30%, 64%] | **0/10** | 8/10 | 6/10 |
+**What is and isn't supported by this data:**
+- ✅ GPT-4o is materially worse than the others on this suite — its CI [30%, 64%] does not overlap with any other model's.
+- ✅ GPT-4o falls for the hidden-OCR trap 100% of cases (CI [72%, 100%]). Every failure returned the hidden-OCR amount specifically.
+- ✅ GPT-5.4 fixes most of it (80% pass on hidden OCR) — a real generational improvement.
+- ❌ "Claude leads" — Sonnet's CI [83%, 99%] overlaps with Gemini's [78%, 98%]. The two are statistically indistinguishable on this suite. Don't read ordinal rankings from 30 cases.
+- ❌ "PDF Hell is sufficient to evaluate document AI." It's a stress test for three specific failure modes. Pair it with a domain benchmark (DocVQA, your own regression suite) for coverage.
+Suite hash: `8ad87b8d` (mini-v1, 30 cases). Every leaderboard row above was measured on the same hash. Raw run JSON at <https://github.com/multivon-ai/multivon-web/tree/main/public/data/pdfhell-runs>.
-## What's in the mini suite
+## What's in mini-v1
 | Trap family | Cases | What breaks |
 |---|---|---|
@@ -68,9 +98,9 @@ pass: 3/3  (100.0%)
 | `footnote_override` | 10 | Legal clauses where a 6pt footnote overrides the body — liability caps with carve-outs, terminations with restrictions, data-residency with disaster-recovery exceptions. |
 | `split_table_across_pages` | 10 | Financial tables where the header row sits on page 1 and the body rows on page 2. RAG loaders that paginate independently lose column context. |
-Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys. `Canvas(invariant=True)` is set on every generator so timestamps and document IDs don't drift between runs.
+Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys (`Canvas(invariant=True)` on every generator).
-The full suite (10 trap families, ~50 cases) is on the [roadmap](#roadmap).
+**Suite versioning.** The `mini-v1` label + suite hash (`8ad87b8d`) fingerprints the exact (trap_family, seed) pairs measured. Adding a new trap family produces `mini-v2` with a different hash — runs across different hashes are not directly comparable. See the next section for the roadmap.
 ## Why this exists

{pdfhell-0.1.0 → pdfhell-0.1.2}/README.md RENAMED Viewed

@@ -1,37 +1,67 @@
 # PDF Hell
-**Adversarial PDFs that break AI document readers — with procedural ground truth, not LLM-as-judge.**
+**Adversarial PDFs that stress-test AI document readers — with procedural ground truth, not LLM-as-judge.**
-PDF Hell is a small, sharp benchmark for the "AI reads PDFs" claim. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same loop that fooled the model isn't asked to grade it.
+PDF Hell is a small, focused benchmark for three specific failure modes in AI document pipelines. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same complexity that fools the model isn't asked to grade it.
-If your AI claims it can read documents, it should survive PDFs designed to break it.
+## The headline finding (mini-v1, 30 cases, 2026-05-17)
+GPT-4o falls for the hidden-OCR trap on **10 out of 10 cases (95% Wilson CI [72%, 100%])** — it consistently returns the *invisible* amount from the PDF's text layer instead of the *visible* amount rendered on the page:
+```
+Trap: hidden_ocr_mismatch (invoice — visible total $12,345.67, hidden OCR total $22,345.67)
+Question: What is the TOTAL AMOUNT DUE?
+→ openai:gpt-4o            $22,345.67   ← fell for trap (10/10 in this trap family)
+→ openai:gpt-5.4-mini      $22,345.67   ← fell for trap (9/10)
+→ openai:gpt-5.4           $12,345.67   ← correct (8/10 across trap)
+→ google:gemini-2.5-flash  $12,345.67   ← correct (10/10)
+→ anthropic:claude-sonnet-4-6  $12,345.67   ← correct (10/10)
+```
+The visible page, the hidden text layer, and an agent that fuses both will give three different answers. pdfhell exists to catch that.
 ## Quickstart (30 seconds)
 ```bash
-# 3-case smoke run against the cheapest vision model — works in any env with a Gemini key
+# 3-case smoke run against the cheapest vision model
 export GOOGLE_API_KEY=...
 uvx pdfhell run --model google:gemini-2.5-flash --suite smoke
-# Or run the full mini suite (30 cases, ~10s on Flash, ~$0.01)
+# Or the full mini-v1 suite (30 cases, ~10s on Flash, ~$0.01)
 uvx pdfhell run --model anthropic:claude-sonnet-4-6 --suite mini
-# Or just generate one trap PDF and open it
+# Or generate one trap PDF and inspect it
 uvx pdfhell make --trap hidden_ocr_mismatch --seed 42
 open ./cases/hidden_ocr_mismatch-0042.pdf
 ```
-That's it. `pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
+`pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
-Smoke result on Gemini 2.5 Flash (one case per family, run this minute):
+## Mini-v1 leaderboard (8 models, 30 cases)
-```
-PDF Hell smoke suite — n=3
-model: google:gemini-2.5-flash
-pass: 3/3  (100.0%)
-```
+| Model | Pass rate | 95% CI | Hidden OCR | Footnote | Split table |
+|---|---:|---:|---:|---:|---:|
+| `anthropic:claude-sonnet-4-6` | 29/30 (97%) | [83%, 99%] | 10/10 | 9/10 | 10/10 |
+| `google:gemini-3.1-pro-preview` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-3.1-flash-lite` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-pro` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-flash` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `openai:gpt-5.4` | 27/30 (90%) | [74%, 97%] | 8/10 | 9/10 | 10/10 |
+| `openai:gpt-5.4-mini` | 20/30 (67%) | [49%, 81%] | 1/10 | 9/10 | 10/10 |
+| `openai:gpt-4o` | 14/30 (47%) | [30%, 64%] | **0/10** | 8/10 | 6/10 |
+**What is and isn't supported by this data:**
+- ✅ GPT-4o is materially worse than the others on this suite — its CI [30%, 64%] does not overlap with any other model's.
+- ✅ GPT-4o falls for the hidden-OCR trap 100% of cases (CI [72%, 100%]). Every failure returned the hidden-OCR amount specifically.
+- ✅ GPT-5.4 fixes most of it (80% pass on hidden OCR) — a real generational improvement.
+- ❌ "Claude leads" — Sonnet's CI [83%, 99%] overlaps with Gemini's [78%, 98%]. The two are statistically indistinguishable on this suite. Don't read ordinal rankings from 30 cases.
+- ❌ "PDF Hell is sufficient to evaluate document AI." It's a stress test for three specific failure modes. Pair it with a domain benchmark (DocVQA, your own regression suite) for coverage.
+Suite hash: `8ad87b8d` (mini-v1, 30 cases). Every leaderboard row above was measured on the same hash. Raw run JSON at <https://github.com/multivon-ai/multivon-web/tree/main/public/data/pdfhell-runs>.
-## What's in the mini suite
+## What's in mini-v1
 | Trap family | Cases | What breaks |
 |---|---|---|
@@ -39,9 +69,9 @@ pass: 3/3  (100.0%)
 | `footnote_override` | 10 | Legal clauses where a 6pt footnote overrides the body — liability caps with carve-outs, terminations with restrictions, data-residency with disaster-recovery exceptions. |
 | `split_table_across_pages` | 10 | Financial tables where the header row sits on page 1 and the body rows on page 2. RAG loaders that paginate independently lose column context. |
-Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys. `Canvas(invariant=True)` is set on every generator so timestamps and document IDs don't drift between runs.
+Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys (`Canvas(invariant=True)` on every generator).
-The full suite (10 trap families, ~50 cases) is on the [roadmap](#roadmap).
+**Suite versioning.** The `mini-v1` label + suite hash (`8ad87b8d`) fingerprints the exact (trap_family, seed) pairs measured. Adding a new trap family produces `mini-v2` with a different hash — runs across different hashes are not directly comparable. See the next section for the roadmap.
 ## Why this exists

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/__init__.py RENAMED Viewed

@@ -16,7 +16,7 @@ layer; the runtime, scoring, and reporting come from multivon-eval.
 """
 from __future__ import annotations
-__version__ = "0.1.0"
+__version__ = "0.1.2"
 from .case import HellCase
 from .generators import (

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/auditpack.py RENAMED Viewed

@@ -39,14 +39,24 @@ from .scorer import SuiteReport
 _README_TEMPLATE = """\
 # pdfhell audit pack
-This ZIP is a complete, self-describing record of one PDF Hell run. It
+This ZIP is a complete, self-verifying record of one PDF Hell run. It
 contains every PDF the model was asked to read, every answer key, the
-raw model output, and a tamper-evident manifest.
+raw model output, and a SHA-256 manifest you can recompute from the
+ZIP contents to check that nothing has been edited since delivery.
+## A note on threat model
+This is a self-verifying pack, not a tamper-PROOF one: an adversary
+with access to the ZIP can edit any file AND re-write the manifest
+hashes to match. To detect tampering by an external party, pin the
+manifest's SHA-256 in your procurement record (out-of-band) and
+verify on receipt. For full tamper-proof attestation, sign the
+manifest with an external GPG / Sigstore key (out of scope for v1).
 ## What's in this pack
 - manifest.json — Run metadata + SHA-256 of every file in this ZIP.
-- run.json — Full run report (per-case scores, model outputs).
+- run.json — Full run report (per-case scores, model outputs, CIs).
 - run.xml — JUnit XML (renders in CI dashboards).
 - cases/*.pdf — The adversarial PDFs the model was tested against.
 - cases/*.json — The answer keys + per-case metadata.
@@ -54,9 +64,6 @@ raw model output, and a tamper-evident manifest.
 ## How to verify
-The manifest contains a SHA-256 for every file in this ZIP. To verify
-nothing was edited after delivery:
   unzip -p audit-pack.zip manifest.json | jq .files
   sha256sum cases/*.pdf cases/*.json run.json run.xml README.txt
@@ -70,13 +77,17 @@ byte-identical PDFs and re-run the same model:
   {repro_command}
 pdfhell uses Canvas(invariant=True) on every generator so PDFs are
-byte-identical across runs with the same seed.
+byte-identical across runs with the same seed. The manifest's
+`suite_hash` fingerprints the exact (trap_family, seed) pairs that
+were measured — re-runs with a different hash measured different
+cases and are not directly comparable.
 ## Scope
 pdfhell {pdfhell_version}, suite {suite}, model {model}. Generated
-{timestamp}. {n} cases, {passed}/{n} passed ({pass_rate:.0%}). See
-manifest.json for per-trap breakdown.
+{timestamp}. {n} cases, {passed}/{n} passed ({pass_rate:.0%}, 95%
+Wilson CI shown in manifest.json). See manifest.json for per-trap
+breakdown and per-trap CIs.
 """
@@ -149,16 +160,22 @@ def build_audit_pack(
         "generated_at": timestamp,
         "model": report.model,
         "suite": report.suite,
+        "suite_version": report.suite_version,
+        "suite_hash": report.suite_hash,
         "n": report.n,
         "passed": passed,
         "pass_rate": report.pass_rate,
+        "pass_rate_ci_95": list(report.pass_rate_ci),
         "per_trap_pass": report.per_trap_pass,
+        "per_trap_pass_ci_95": {k: list(v) for k, v in report.per_trap_pass_ci.items()},
         "per_trap_fell_for_trap": report.per_trap_fell_for_trap,
         "reproduction": {
             "command": repro_command,
             "note": (
                 "PDFs are regenerated byte-identically via Canvas(invariant=True). "
-                "Same seed → same PDF → same answer key."
+                "Same seed → same PDF → same answer key. The suite_hash above "
+                "fingerprints the exact (trap, seed) pairs measured — auditors "
+                "should refuse any run with a mismatched hash."
             ),
         },
         "files": [

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/cli.py RENAMED Viewed

@@ -34,6 +34,40 @@ def _cmd_list_traps(args: argparse.Namespace) -> int:
     return 0
+def _cmd_discover(args: argparse.Namespace) -> int:
+    """Print the machine-readable pdfhell capability catalog as JSON.
+    Same shape an agent gets via the multivon-mcp ``eval_discover`` tool —
+    surfaced as a CLI command so agents that don't speak MCP can pipe
+    ``pdfhell discover --json | jq ...`` to plan a run.
+    """
+    from .generators import GENERATORS
+    catalog = {
+        "package": "pdfhell",
+        "version": __version__,
+        "traps": [],
+        "suites": [],
+    }
+    for trap in TRAP_FAMILIES:
+        _, example_case = GENERATORS[trap](seed=1)
+        catalog["traps"].append({
+            "name": trap,
+            "example_question": example_case.question,
+            "example_expected_answer": example_case.expected_answer,
+        })
+    for name, spec in SUITES.items():
+        catalog["suites"].append({
+            "name": name,
+            "version": spec.version,
+            "suite_hash": spec.suite_hash,
+            "total_cases": spec.total_cases,
+            "trap_seeds": {trap: list(seeds) for trap, seeds in spec.traps.items()},
+        })
+    json.dump(catalog, sys.stdout, indent=2 if not args.compact else None)
+    print()
+    return 0
 def _cmd_make(args: argparse.Namespace) -> int:
     try:
         pdf_bytes, case = generate_case(args.trap, args.seed)
@@ -158,6 +192,13 @@ def build_parser() -> argparse.ArgumentParser:
     p_list = sub.add_parser("list-traps", help="list available trap families")
     p_list.set_defaults(func=_cmd_list_traps)
+    p_discover = sub.add_parser(
+        "discover",
+        help="emit pdfhell capability catalog as JSON (for agents that don't speak MCP)",
+    )
+    p_discover.add_argument("--compact", action="store_true", help="single-line JSON, no indent")
+    p_discover.set_defaults(func=_cmd_discover)
     p_make = sub.add_parser("make", help="generate one case (pdf + json)")
     p_make.add_argument("--trap", required=True, choices=TRAP_FAMILIES)
     p_make.add_argument("--seed", required=True, type=int)

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/runner.py RENAMED Viewed

@@ -23,6 +23,7 @@ from multivon_eval import JudgeConfig
 from .case import HellCase
 from .scorer import CaseScore, SuiteReport, score_case, summarise
+from .suite import SUITES
 from .vision import call_vision
@@ -98,6 +99,10 @@ def run_suite(
     ``cases_dir`` must contain ``<case_id>.json`` and ``<case_id>.pdf``
     pairs produced by :func:`pdfhell.suite.build_suite`.
+    The returned :class:`SuiteReport` is annotated with the canonical
+    ``suite_version`` and ``suite_hash`` from the named suite, so
+    consumers can verify the run measured the expected cases.
     """
     judge = parse_model_spec(model_spec)
     jobs = list(_load_jobs(cases_dir))
@@ -123,7 +128,12 @@ def run_suite(
             if progress:
                 mark = "✓" if score.correct else ("⚠" if score.fell_for_trap else "✗")
                 print(f"  {mark} {score.case_id:36s}  expected={score.expected!r:30s}  got={answer[:60]!r}")
-    return summarise(model_spec, suite_name, scores)
+    report = summarise(model_spec, suite_name, scores)
+    spec = SUITES.get(suite_name)
+    if spec is not None:
+        report.suite_version = spec.version
+        report.suite_hash = spec.suite_hash
+    return report
 def _load_jobs(cases_dir: Path) -> Iterable[_Job]:

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/scorer.py RENAMED Viewed

@@ -9,9 +9,17 @@ QAG (multivon-eval's :class:`~multivon_eval.DocumentGrounding`) is
 available separately as the *explanation* of why a model failed — "the
 model returned $19,900.25, matching the hidden-OCR layer rather than
 the visible $18,900.25" — but it never affects pass/fail.
+Every reported pass rate is paired with a 95% Wilson confidence
+interval. A 10-case trap-family run at 100% pass has Wilson 95% CI
+[0.72, 1.00] — meaning the *true* per-trap pass rate could plausibly
+be as low as 72%. Differences of <~10pp at n=30 are not statistically
+distinguishable. We surface the CI everywhere we surface the rate so
+nobody draws ordinal conclusions from indistinguishable runs.
 """
 from __future__ import annotations
+import math
 import re
 from dataclasses import dataclass, field
 from typing import Any
@@ -19,6 +27,33 @@ from typing import Any
 from .case import HellCase
+# ─── Statistical-rigor utility ─────────────────────────────────────────────
+def wilson_ci(passes: int, n: int, *, z: float = 1.959963984540054) -> tuple[float, float]:
+    """Return the (lower, upper) Wilson 95% confidence interval for a
+    binomial proportion of ``passes`` successes out of ``n`` trials.
+    Defaults to z = 1.96 (95% CI). Pass z=2.576 for 99% CI. Returns
+    (0.0, 1.0) when ``n == 0`` — vacuous CI for an empty run.
+    Why Wilson over the Wald / normal-approximation interval? At our
+    sample sizes (n=10 per trap, n=30 per suite) the Wald interval is
+    *wrong* near 0 and 1 (it can return negative lower bounds or
+    upper bounds > 1, both nonsensical for a probability). Wilson is
+    well-behaved across the entire [0, 1] domain and is the standard
+    interval for small-sample proportion estimates.
+    """
+    if n <= 0:
+        return (0.0, 1.0)
+    p = passes / n
+    denom = 1.0 + (z * z) / n
+    centre = (p + (z * z) / (2.0 * n)) / denom
+    half = (z / denom) * math.sqrt((p * (1.0 - p) + (z * z) / (4.0 * n)) / n)
+    lo = max(0.0, centre - half)
+    hi = min(1.0, centre + half)
+    return (lo, hi)
 _WHITESPACE_RE = re.compile(r"\s+")
 _PUNCT_NORMALIZE_RE = re.compile(r"[.,;:]+\s*$")
@@ -165,14 +200,47 @@ class SuiteReport:
     per_trap_fell_for_trap: dict[str, float]
     refused_rate: float
     cases: list[CaseScore] = field(default_factory=list)
+    suite_version: str = ""  # e.g. "mini-v1" — see pdfhell.suite.SuiteSpec.version
+    suite_hash: str = ""  # 8-char SHA-256 prefix of the sorted (trap, seed) pairs
+    # ─── Confidence intervals ──────────────────────────────────────────────
+    @property
+    def pass_rate_ci(self) -> tuple[float, float]:
+        """95% Wilson confidence interval on the overall pass rate."""
+        return wilson_ci(int(round(self.pass_rate * self.n)), self.n)
+    @property
+    def per_trap_pass_ci(self) -> dict[str, tuple[float, float]]:
+        """Per-trap-family Wilson 95% CIs.
+        Uses the actual case counts (typically 10 per family in the mini
+        suite). Surfaced on the leaderboard so 100% pass on n=10 isn't
+        confused with "the model never fails."
+        """
+        if not self.cases:
+            return {}
+        # Count cases per family rather than guessing.
+        by_family: dict[str, list[CaseScore]] = {}
+        for c in self.cases:
+            by_family.setdefault(c.trap_family, []).append(c)
+        out: dict[str, tuple[float, float]] = {}
+        for family, scores in by_family.items():
+            passes = sum(1 for s in scores if s.correct)
+            out[family] = wilson_ci(passes, len(scores))
+        return out
     def to_dict(self) -> dict[str, Any]:
         return {
             "model": self.model,
             "suite": self.suite,
+            "suite_version": self.suite_version,
+            "suite_hash": self.suite_hash,
             "n": self.n,
             "pass_rate": self.pass_rate,
+            "pass_rate_ci": list(self.pass_rate_ci),
             "per_trap_pass": self.per_trap_pass,
+            "per_trap_pass_ci": {k: list(v) for k, v in self.per_trap_pass_ci.items()},
             "per_trap_fell_for_trap": self.per_trap_fell_for_trap,
             "refused_rate": self.refused_rate,
             "cases": [c.to_dict() for c in self.cases],

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell/suite.py RENAMED Viewed

@@ -7,9 +7,20 @@ answer keys.
 This is part of the "code-based ground truth" promise: the suite isn't
 a static blob, it's a recipe + a verifiable hash.
+# Versioning
+Suites are versioned (e.g. ``mini-v1``) so adding a new trap family
+doesn't silently invalidate published leaderboard numbers. Each suite
+also carries a :attr:`SuiteSpec.suite_hash` — an 8-char SHA-256 prefix
+of the sorted ``(trap_family, seed)`` pairs. Two runs with the same
+``suite_hash`` measured the *exact* same cases; runs with different
+hashes are not directly comparable. The hash is included in every
+``SuiteReport`` and the audit pack ``manifest.json``.
 """
 from __future__ import annotations
+import hashlib
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Iterable
@@ -25,25 +36,52 @@ class SuiteSpec:
     ``traps`` maps a trap family name to a list of seeds — those exact
     seeds produce those exact PDFs. Run ``pdfhell build-suite --suite
     mini`` to materialise to disk.
+    ``version`` is the human-readable label that gets published in
+    leaderboard rows (e.g. ``mini-v1``). Bump the version (and the name)
+    when adding trap families so historical comparisons stay valid.
     """
     name: str
     traps: dict[str, list[int]] = field(default_factory=dict)
+    version: str = ""
     @property
     def total_cases(self) -> int:
         return sum(len(s) for s in self.traps.values())
+    @property
+    def suite_hash(self) -> str:
+        """8-char SHA-256 prefix of the sorted ``(trap, seed)`` pairs.
+        Two suites with the same ``suite_hash`` evaluated the EXACT same
+        cases; runs across different hashes are not directly comparable.
+        Surfaced in every SuiteReport + the audit-pack manifest.
+        """
+        items = sorted(
+            (trap, seed)
+            for trap, seeds in self.traps.items()
+            for seed in seeds
+        )
+        payload = "\n".join(f"{trap}\t{seed}" for trap, seed in items).encode("utf-8")
+        return hashlib.sha256(payload).hexdigest()[:8]
 def mini_suite() -> SuiteSpec:
-    """The canonical ``mini`` suite: 30 cases, 10 per trap family.
+    """The canonical ``mini-v1`` suite: 30 cases, 10 per trap family.
     Seeds are arbitrary but fixed. The published leaderboard at
     ``multivon.ai/leaderboard`` runs this exact spec — re-running it on
     any machine produces identical PDFs.
+    Versioning: adding a new trap family to the mini suite produces a
+    new spec (``mini-v2``, etc.). Older leaderboard rows tagged
+    ``mini-v1`` remain directly comparable across machines, dates, and
+    judge versions; rows tagged different versions are not.
     """
     return SuiteSpec(
         name="mini",
+        version="mini-v1",
         traps={
             "hidden_ocr_mismatch":      list(range(1001, 1011)),
             "footnote_override":        list(range(2001, 2011)),
@@ -64,6 +102,7 @@ def smoke_suite() -> SuiteSpec:
     """
     return SuiteSpec(
         name="smoke",
+        version="smoke-v1",
         traps={
             "hidden_ocr_mismatch":      [1001],
             "footnote_override":        [2001],

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pdfhell
-Version: 0.1.0
+Version: 0.1.2
 Summary: PDF Hell — adversarial PDFs that break AI document readers. Procedural ground truth, not LLM-as-judge.
 Author: Multivon
 License: Apache-2.0
@@ -29,38 +29,68 @@ Dynamic: license-file
 # PDF Hell
-**Adversarial PDFs that break AI document readers — with procedural ground truth, not LLM-as-judge.**
+**Adversarial PDFs that stress-test AI document readers — with procedural ground truth, not LLM-as-judge.**
-PDF Hell is a small, sharp benchmark for the "AI reads PDFs" claim. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same loop that fooled the model isn't asked to grade it.
+PDF Hell is a small, focused benchmark for three specific failure modes in AI document pipelines. Every test case is a PDF generated *from code*, so the correct answer is known exactly. There's no LLM judging another LLM's interpretation — the same complexity that fools the model isn't asked to grade it.
-If your AI claims it can read documents, it should survive PDFs designed to break it.
+## The headline finding (mini-v1, 30 cases, 2026-05-17)
+GPT-4o falls for the hidden-OCR trap on **10 out of 10 cases (95% Wilson CI [72%, 100%])** — it consistently returns the *invisible* amount from the PDF's text layer instead of the *visible* amount rendered on the page:
+```
+Trap: hidden_ocr_mismatch (invoice — visible total $12,345.67, hidden OCR total $22,345.67)
+Question: What is the TOTAL AMOUNT DUE?
+→ openai:gpt-4o            $22,345.67   ← fell for trap (10/10 in this trap family)
+→ openai:gpt-5.4-mini      $22,345.67   ← fell for trap (9/10)
+→ openai:gpt-5.4           $12,345.67   ← correct (8/10 across trap)
+→ google:gemini-2.5-flash  $12,345.67   ← correct (10/10)
+→ anthropic:claude-sonnet-4-6  $12,345.67   ← correct (10/10)
+```
+The visible page, the hidden text layer, and an agent that fuses both will give three different answers. pdfhell exists to catch that.
 ## Quickstart (30 seconds)
 ```bash
-# 3-case smoke run against the cheapest vision model — works in any env with a Gemini key
+# 3-case smoke run against the cheapest vision model
 export GOOGLE_API_KEY=...
 uvx pdfhell run --model google:gemini-2.5-flash --suite smoke
-# Or run the full mini suite (30 cases, ~10s on Flash, ~$0.01)
+# Or the full mini-v1 suite (30 cases, ~10s on Flash, ~$0.01)
 uvx pdfhell run --model anthropic:claude-sonnet-4-6 --suite mini
-# Or just generate one trap PDF and open it
+# Or generate one trap PDF and inspect it
 uvx pdfhell make --trap hidden_ocr_mismatch --seed 42
 open ./cases/hidden_ocr_mismatch-0042.pdf
 ```
-That's it. `pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
+`pdfhell run` builds the suite on first use, sends each PDF to the vision model, and grades the answer against code-based ground truth.
-Smoke result on Gemini 2.5 Flash (one case per family, run this minute):
+## Mini-v1 leaderboard (8 models, 30 cases)
-```
-PDF Hell smoke suite — n=3
-model: google:gemini-2.5-flash
-pass: 3/3  (100.0%)
-```
+| Model | Pass rate | 95% CI | Hidden OCR | Footnote | Split table |
+|---|---:|---:|---:|---:|---:|
+| `anthropic:claude-sonnet-4-6` | 29/30 (97%) | [83%, 99%] | 10/10 | 9/10 | 10/10 |
+| `google:gemini-3.1-pro-preview` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-3.1-flash-lite` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-pro` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `google:gemini-2.5-flash` | 28/30 (93%) | [78%, 98%] | 10/10 | 8/10 | 10/10 |
+| `openai:gpt-5.4` | 27/30 (90%) | [74%, 97%] | 8/10 | 9/10 | 10/10 |
+| `openai:gpt-5.4-mini` | 20/30 (67%) | [49%, 81%] | 1/10 | 9/10 | 10/10 |
+| `openai:gpt-4o` | 14/30 (47%) | [30%, 64%] | **0/10** | 8/10 | 6/10 |
+**What is and isn't supported by this data:**
+- ✅ GPT-4o is materially worse than the others on this suite — its CI [30%, 64%] does not overlap with any other model's.
+- ✅ GPT-4o falls for the hidden-OCR trap 100% of cases (CI [72%, 100%]). Every failure returned the hidden-OCR amount specifically.
+- ✅ GPT-5.4 fixes most of it (80% pass on hidden OCR) — a real generational improvement.
+- ❌ "Claude leads" — Sonnet's CI [83%, 99%] overlaps with Gemini's [78%, 98%]. The two are statistically indistinguishable on this suite. Don't read ordinal rankings from 30 cases.
+- ❌ "PDF Hell is sufficient to evaluate document AI." It's a stress test for three specific failure modes. Pair it with a domain benchmark (DocVQA, your own regression suite) for coverage.
+Suite hash: `8ad87b8d` (mini-v1, 30 cases). Every leaderboard row above was measured on the same hash. Raw run JSON at <https://github.com/multivon-ai/multivon-web/tree/main/public/data/pdfhell-runs>.
-## What's in the mini suite
+## What's in mini-v1
 | Trap family | Cases | What breaks |
 |---|---|---|
@@ -68,9 +98,9 @@ pass: 3/3  (100.0%)
 | `footnote_override` | 10 | Legal clauses where a 6pt footnote overrides the body — liability caps with carve-outs, terminations with restrictions, data-residency with disaster-recovery exceptions. |
 | `split_table_across_pages` | 10 | Financial tables where the header row sits on page 1 and the body rows on page 2. RAG loaders that paginate independently lose column context. |
-Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys. `Canvas(invariant=True)` is set on every generator so timestamps and document IDs don't drift between runs.
+Every case has a deterministic seed. Re-running with the same seed regenerates **byte-identical PDFs** and identical answer keys (`Canvas(invariant=True)` on every generator).
-The full suite (10 trap families, ~50 cases) is on the [roadmap](#roadmap).
+**Suite versioning.** The `mini-v1` label + suite hash (`8ad87b8d`) fingerprints the exact (trap_family, seed) pairs measured. Adding a new trap family produces `mini-v2` with a different hash — runs across different hashes are not directly comparable. See the next section for the roadmap.
 ## Why this exists

{pdfhell-0.1.0 → pdfhell-0.1.2}/pdfhell.egg-info/SOURCES.txt RENAMED Viewed

@@ -25,4 +25,5 @@ tests/test_auditpack.py
 tests/test_cli.py
 tests/test_generators.py
 tests/test_junit.py
-tests/test_scorer.py
+tests/test_scorer.py
+tests/test_statistical.py

{pdfhell-0.1.0 → pdfhell-0.1.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "pdfhell"
-version = "0.1.0"
+version = "0.1.2"
 description = "PDF Hell — adversarial PDFs that break AI document readers. Procedural ground truth, not LLM-as-judge."
 readme = "README.md"
 requires-python = ">=3.10"

pdfhell-0.1.2/tests/test_statistical.py ADDED Viewed

@@ -0,0 +1,151 @@
+"""Tests for the statistical-rigor additions: Wilson CIs + suite versioning.
+The professor-persona review of pdfhell flagged two methodology gaps:
+single-point pass rates without confidence intervals, and unversioned
+suites that mutate as we add trap families. These tests guard the
+fixes.
+"""
+from __future__ import annotations
+import pytest
+from pdfhell.case import HellCase
+from pdfhell.scorer import score_case, summarise, wilson_ci
+from pdfhell.suite import SUITES, mini_suite, smoke_suite
+# ─── Wilson CI math ────────────────────────────────────────────────────────
+def test_wilson_ci_perfect_score_small_n():
+    """10/10 passes — the CI lower bound is well below 1.0 (small-sample
+    uncertainty). This is the case that motivated adding CIs in the
+    first place — a per-trap 10/10 is not statistically distinguishable
+    from a true rate of 75%."""
+    lo, hi = wilson_ci(10, 10)
+    assert 0.65 < lo < 0.80, f"unexpected lower bound: {lo}"
+    assert hi == pytest.approx(1.0)
+def test_wilson_ci_zero_score_small_n():
+    """0/10 passes — symmetric to the 10/10 case. Upper bound is well
+    above 0.0."""
+    lo, hi = wilson_ci(0, 10)
+    assert lo == pytest.approx(0.0)
+    assert 0.20 < hi < 0.35
+def test_wilson_ci_thirty_case_ci_width():
+    """The mini-suite n=30 at 28/30 (93%) — Wilson CI width must be wide
+    enough that 28/30 vs 29/30 is NOT clearly separable. This guards
+    against accidentally tightening to a narrower interval (e.g. Wald)
+    that would mislead users."""
+    lo_28, hi_28 = wilson_ci(28, 30)
+    lo_29, hi_29 = wilson_ci(29, 30)
+    # The two intervals overlap substantially.
+    assert lo_29 < hi_28, "97% vs 93% CIs should overlap at n=30"
+def test_wilson_ci_empty_run_is_vacuous():
+    """n=0 → CI is the full [0, 1]. Don't crash on empty runs."""
+    lo, hi = wilson_ci(0, 0)
+    assert lo == 0.0
+    assert hi == 1.0
+def test_wilson_ci_z_parameter():
+    """99% CI is wider than 95% CI for the same data."""
+    lo95, hi95 = wilson_ci(7, 10)
+    lo99, hi99 = wilson_ci(7, 10, z=2.576)
+    assert lo99 < lo95
+    assert hi99 > hi95
+# ─── Suite versioning ──────────────────────────────────────────────────────
+def test_mini_suite_is_versioned():
+    spec = mini_suite()
+    assert spec.version == "mini-v1"
+    assert spec.suite_hash, "suite_hash must be set"
+    assert len(spec.suite_hash) == 8
+def test_smoke_suite_is_versioned():
+    spec = smoke_suite()
+    assert spec.version == "smoke-v1"
+    assert spec.suite_hash
+def test_suite_hash_is_deterministic():
+    """Same trap-seed contents → same hash."""
+    a = mini_suite().suite_hash
+    b = mini_suite().suite_hash
+    assert a == b
+def test_suite_hash_differs_with_different_seeds():
+    """Mutating the seeds changes the hash. Adding a new trap family
+    must not silently keep the same suite_hash."""
+    a = mini_suite()
+    b = mini_suite()
+    b.traps["new_trap_family"] = [9001]
+    assert a.suite_hash != b.suite_hash
+def test_suites_registered():
+    assert "mini" in SUITES
+    assert "smoke" in SUITES
+    assert SUITES["mini"].version == "mini-v1"
+# ─── SuiteReport CI integration ────────────────────────────────────────────
+def _make_case(expected: str) -> HellCase:
+    return HellCase(
+        id="t-0001",
+        trap_family="hidden_ocr_mismatch",
+        seed=1,
+        question="x?",
+        expected_answer=expected,
+    )
+def test_suite_report_carries_pass_rate_ci():
+    cases = [score_case(_make_case("$1.00"), "$1.00") for _ in range(10)]
+    cases += [score_case(_make_case("$2.00"), "wrong") for _ in range(5)]
+    report = summarise("test:model", "mini", cases)
+    lo, hi = report.pass_rate_ci
+    assert 0.0 <= lo <= report.pass_rate <= hi <= 1.0
+def test_suite_report_to_dict_includes_cis_and_version():
+    cases = [score_case(_make_case("$1.00"), "$1.00") for _ in range(3)]
+    report = summarise("test:model", "mini", cases)
+    report.suite_version = "mini-v1"
+    report.suite_hash = "deadbeef"
+    d = report.to_dict()
+    assert "pass_rate_ci" in d
+    assert "per_trap_pass_ci" in d
+    assert d["suite_version"] == "mini-v1"
+    assert d["suite_hash"] == "deadbeef"
+    # CIs are lists not tuples (JSON-friendly).
+    assert isinstance(d["pass_rate_ci"], list)
+    assert len(d["pass_rate_ci"]) == 2
+def test_per_trap_ci_uses_actual_case_counts():
+    """Per-trap CI must reflect the number of cases in that family.
+    Mixing families with different N counts shouldn't collapse to one
+    aggregate."""
+    cases = [
+        # 10 hidden_ocr passes
+        *[score_case(_make_case("$1.00"), "$1.00") for _ in range(10)],
+    ]
+    report = summarise("test:model", "mini", cases)
+    cis = report.per_trap_pass_ci
+    assert "hidden_ocr_mismatch" in cis
+    lo, hi = cis["hidden_ocr_mismatch"]
+    # 10/10 at n=10 → lower bound ~0.72
+    assert 0.65 < lo < 0.80