PyPI - docpluck - Versions diffs - 2.4.0__tar.gz → 2.4.2__tar.gz - Mend

docpluck 2.4.0tar.gz → 2.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (268) hide show

{docpluck-2.4.0 → docpluck-2.4.2}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,43 @@
 # Changelog
+## [2.4.2] — 2026-05-13
+Iterative follow-up. After v2.4.1 the 101-PDF corpus run was 98/101 PASS (`scripts/verify_corpus_full.py`); this release closes two of the three remaining failures and reframes the third as a known short-paper edge case in the verifier.
+### Fixes
+1. **`docpluck/render.py::_render_sections_to_markdown`** — table emission when Camelot returned no cells. Previously, a located table with a caption but no structured cells produced ``### Table N\n*caption*\n`` in body markdown — promising structured content that wasn't there. Verifier flagged this with the `H` tag (missing_html). Two papers affected: `bjps_4`, `ar_apa_j_jesp_2009_12_011`. New behavior: when `html` is empty for a body-located table, skip the `### Table N` heading and emit only the caption as a plain italic paragraph (`*Table N. caption text*`). The table reference is still surfaced in body flow, but without the false promise of structured HTML. Same treatment for the unlocated-tables appendix — tables with neither caption nor cells are dropped (a bare `### Table N` stub is information-free).
+2. **`docpluck/render.py::_render_sections_to_markdown`** — uppercase canonical section headings when pdftotext flattens Elsevier letter-spaced typography. JESP / Cognition / JEP papers render their section headings with letter-spacing (``a b s t r a c t``), which pdftotext extracts as a lone lowercase word. Without this fix the rendered output mixes ``## abstract`` with ``## Methods`` / ``## Results`` — a stylistic blemish on every Elsevier-style paper. New rule: when the captured `heading_text` is entirely lowercase ASCII AND the section has a recognized canonical label, replace the heading with the pretty Title-Case form (`Abstract`, `Keywords`, etc.). All-caps publisher headings (JAMA ``RESULTS``) are preserved verbatim — only lowercase is rewritten.
+### Verifier upgrade
+3. **`scripts/verify_corpus_full.py::_classify`** — short-paper exemption. The `S` (section_count < 4) and `X` (output < 5 KB) tags are now suppressed when the rendered title contains `ADDENDUM` / `CORRIGENDUM` / `CORRECTION` / `ERRATUM` / `RETRACTION`. The canonical example is `jdm_.2023.10`, a 1-page archival correction notice that legitimately has 1 section and ~1 KB of body content; flagging it as a render failure was a verifier false positive.
+### Bumps
+- `__version__`: `2.4.1` → `2.4.2`. Patch — render behavior changes affect only the 2 H-tagged papers + lowercase-abstract heading on Elsevier-style papers; no API change.
+### Tests
+6 new tests in `tests/test_render.py` covering the H-tag emission rules (body-located + appendix), the lowercase-canonical heading uppercase rule, and the happy-path no-op cases.
+## [2.4.1] — 2026-05-12
+Same-day follow-up to v2.4.0. Expanded testing to all 101 PDFs in the wider corpus (vs the 26 spike-baseline papers) and fixed the most common new failure: missing-title on AMA/AOM single-line title layouts.
+### Fixes
+1. **`docpluck/render.py::_compute_layout_title`** — title-size selection in two passes:
+   - Pass 1 (unchanged): largest font with count ≥ 2 (multi-line titles).
+   - Pass 2 (new): largest font in the TOP region (y0 ≥ 70% of page height) with count ≥ 1 and combined span text ≥ 10 chars.
+   Without the top-region restriction + text-length floor, a stray same-font glyph elsewhere on the page (a "+" decoration at font 16.0, an "GUIDEPOST" feature-label at font 30.0) would outrank a real single-line title at a smaller-but-still-large font. Affects: `jama_open_3`, `jama_open_4`, `jama_open_6`, `jama_open_10`, `annals_4`, `amd_1` and similar AMA/AOM-style papers.
+### Bumps
+- `__version__`: `2.4.0` → `2.4.1`. Patch-level — internal heuristic improvement, no API change.
 ## [2.4.0] — 2026-05-12
 Same-day follow-up. Closes the three real library bugs surfaced by the AI-Chrome visual verification pass on all 26 corpus papers documented in `docs/HANDOFF_2026-05-12_visual_verify_results.md`. The API-level `verify_corpus.py` was passing 26/26 throughout but couldn't see these — visual inspection in the workspace was needed.

{docpluck-2.4.0 → docpluck-2.4.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.0
+Version: 2.4.2
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://github.com/giladfeldman/docpluck
 Project-URL: Documentation, https://github.com/giladfeldman/docpluck/tree/main/docs

{docpluck-2.4.0 → docpluck-2.4.2}/docpluck/__init__.py RENAMED Viewed

@@ -71,7 +71,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.0"
+__version__ = "2.4.2"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.0 → docpluck-2.4.2}/docpluck/render.py RENAMED Viewed

@@ -610,13 +610,34 @@ def _compute_layout_title(layout_doc: LayoutDoc) -> Optional[str]:
         round(float(s.font_size) * 2) / 2 for s in upper_spans
     )
     title_size: Optional[float] = None
+    # Pass 1: largest font with count >= 2 (the title typically spans
+    # 2-3 lines).
     for sz, count in sorted(size_counts.items(), reverse=True):
         if sz >= 12.0 and count >= 2:
             title_size = sz
             break
-        if sz >= 14.0 and count >= 1:
-            title_size = sz
-            break
+    if title_size is None:
+        # Pass 2: fall back to the largest font in the TOP region with
+        # count >= 1 AND >= 10 chars of combined span text. The top-
+        # region filter (y0 >= 70% of page height) rejects mid-page
+        # decorations like a "+" badge or section-heading numerals.
+        # The text-length filter rejects short feature-labels (e.g. AOM
+        # papers' "GUIDEPOST" header at font 30) in favor of the longer
+        # title block immediately below.
+        top_region_threshold = height * 0.70
+        top_spans = [s for s in upper_spans if s.y0 >= top_region_threshold]
+        candidate_sizes = sorted(
+            {round(float(s.font_size) * 2) / 2 for s in top_spans},
+            reverse=True,
+        )
+        for sz in candidate_sizes:
+            if sz < 14.0:
+                break
+            matching = [s for s in top_spans if abs(float(s.font_size) - sz) < 0.3]
+            combined_text_len = sum(len((s.text or "").strip()) for s in matching)
+            if combined_text_len >= 10:
+                title_size = sz
+                break
     if title_size is None:
         return None
@@ -1140,6 +1161,23 @@ def _render_sections_to_markdown(
         )
         if not skip_heading:
             heading = sec.heading_text or _pretty_label(sec.label)
+            # v2.4.2: when the heading_text the section detector captured is
+            # entirely lowercase (Elsevier "a b s t r a c t" letter-spaced
+            # typography → pdftotext flattens to "abstract") AND the section
+            # has a recognized canonical label, prefer the pretty Title-Case
+            # form. Without this fix the rendered output reads ``## abstract``
+            # alongside ``## Methods``/``## Results`` — a stylistic blemish
+            # that surfaces on every Elsevier (JESP, Cognition, JEP) paper.
+            if (
+                heading
+                and heading == heading.lower()
+                and heading.isascii()
+                and any(c.isalpha() for c in heading)
+                and canonical != "unknown"
+            ):
+                pretty = _pretty_label(sec.label)
+                if pretty and pretty != heading:
+                    heading = pretty
             # \n\n (not \n) separates heading from body so downstream
             # markdown renderers treat them as a heading block + paragraph,
             # not as one mashed paragraph starting with "## Abstract ...".
@@ -1170,11 +1208,19 @@ def _render_sections_to_markdown(
             if kind == "table":
                 cells = item.get("cells") or []
                 html = item.get("html") or (cells_to_html(cells) if cells else "")
-                body_chunks.append(f"\n### {label}\n")
-                if cap:
-                    body_chunks.append(f"*{cap}*\n")
                 if html:
+                    body_chunks.append(f"\n### {label}\n")
+                    if cap:
+                        body_chunks.append(f"*{cap}*\n")
                     body_chunks.append(html)
+                elif cap:
+                    # v2.4.2: Camelot returned no cells for this caption.
+                    # Skip the `### Table N` heading (which would falsely
+                    # promise structured content) and emit the caption as a
+                    # plain italicized paragraph so the table reference is
+                    # preserved in body flow. Affected papers in the
+                    # 101-PDF corpus: bjps_4, ar_apa_j_jesp_2009_12_011.
+                    body_chunks.append(f"\n*{cap}*\n")
             else:
                 body_chunks.append(f"\n### {label}\n")
                 if cap:
@@ -1192,18 +1238,28 @@ def _render_sections_to_markdown(
     leftover_figures.extend(unlocated_figures)
     if leftover_tables:
-        out_chunks.append("## Tables (unlocated in body)\n\n")
-        for t in leftover_tables:
-            label = t.get("label") or "Table"
-            cap = t.get("caption") or ""
-            cells = t.get("cells") or []
-            html = t.get("html") or (cells_to_html(cells) if cells else "")
-            out_chunks.append(f"### {label}\n")
-            if cap:
-                out_chunks.append(f"*{cap}*\n")
-            if html:
-                out_chunks.append(html + "\n")
-            out_chunks.append("\n")
+        # v2.4.2: drop tables that have neither a caption nor structured
+        # HTML — emitting a bare ``### Table N`` header in the appendix
+        # adds no information and clutters the output.
+        renderable_tables = [
+            t for t in leftover_tables
+            if (t.get("caption") or "").strip()
+            or t.get("html")
+            or t.get("cells")
+        ]
+        if renderable_tables:
+            out_chunks.append("## Tables (unlocated in body)\n\n")
+            for t in renderable_tables:
+                label = t.get("label") or "Table"
+                cap = t.get("caption") or ""
+                cells = t.get("cells") or []
+                html = t.get("html") or (cells_to_html(cells) if cells else "")
+                out_chunks.append(f"### {label}\n")
+                if cap:
+                    out_chunks.append(f"*{cap}*\n")
+                if html:
+                    out_chunks.append(html + "\n")
+                out_chunks.append("\n")
     if leftover_figures:
         out_chunks.append("## Figures\n\n")

docpluck-2.4.2/docs/HANDOFF_2026-05-12_phase2_101pdf_corpus.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Handoff — Phase 2 (101-PDF corpus expansion)
+**Session date:** 2026-05-12 (continuation of the v2.3.1 → v2.4.0 → v2.4.1 release chain)
+## State at handoff
+- **Library:** v2.4.1 tagged + pushed. PyPI not published.
+- **App pin:** `docpluck v2.4.0` in `PDFextractor/service/requirements.txt`. Needs bump to v2.4.1 next session.
+- **26-paper corpus verifier (`scripts/verify_corpus.py`):** 26/26 PASS at v2.4.1.
+- **101-paper corpus verifier (`scripts/verify_corpus_full.py`, new this session):** partial-run result before v2.4.1 was applied — 7 failures observed in 25 papers processed (run cancelled to ship v2.4.1). Of those, 5 were the M-tag (missing title) on AMA/AOM single-span-title layouts that v2.4.1 specifically targets. **Next session must re-run with `python scripts/verify_corpus_full.py` to enumerate the actual v2.4.1 failure set.**
+## What's in v2.4.1
+A single fix to `_compute_layout_title` in `docpluck/render.py`:
+- Pass 2 of the title-size selector (single-span fallback) now requires the span to be in the TOP region of the page (y0 ≥ 70% of page height) AND have ≥ 10 chars of combined text.
+- Catches AMA/AOM cases where a mid-page big-font decoration (a "+" glyph at font 16.0, an "GUIDEPOST" feature-label at font 30.0) was outranking the actual title at a smaller font (e.g. font 15.0 on the JAMA Open layout).
+Affects: `jama_open_3/4/6/10`, `amd_1`, `annals_4`, and likely several more AMA-format papers in the wider 101-PDF corpus.
+## Known issues remaining (from partial 101-run)
+| Paper | Tag | Cause |
+|---|---|---|
+| `ar_apa_j_jesp_2009_12_011` | H | Camelot couldn't extract any tables despite body referencing them (`### Table N` headings present but no `<table>` HTML). Known Camelot limitation; banner already warns user. |
+Other papers' status under v2.4.1 is **unknown** — the partial run was on the v2.4.0 code path and is now stale.
+## Recommended next-session workflow
+1. **Bump app pin** in `PDFextractor/service/requirements.txt`: `v2.4.0` → `v2.4.1`. Commit + push.
+2. **Run full 101-PDF verifier:** `python scripts/verify_corpus_full.py --save-renders` (15-30 min).
+3. **Triage failures** by tag frequency: M / D / R / S / H / C / X / L / J. Probably 2-5 distinct root-cause patterns.
+4. **Pick top 1-2 patterns** with highest paper-count, root-cause, fix in `render.py` (or wherever it lives), add unit tests.
+5. **Re-run 26-paper verifier** to guard against regressions.
+6. **Tag + push** as v2.4.2.
+7. **Visual spot-check** of representative fixed papers through the workspace via Chrome MCP.
+8. Repeat from step 2 until weekly quota exhausted or all 101 papers pass.
+## Renders directory
+`tmp/renders_v2.4.0/` contains rendered `.md` files for the ~25 papers processed in the partial run. Useful for grepping for "## Heading word" patterns and other regressions before re-running. **Stale at v2.4.1** — re-render is needed to update them.
+## Tagging legend (for the new verifier)
+| Tag | Meaning |
+|---|---|
+| M | missing `# Title` line |
+| T | title ends in connector word ("of", "the", "and", ...) — almost certainly truncated |
+| D | title is missing distinct words ≥ 4 letters that the spike baseline has (middle truncation; needs spike baseline to fire) |
+| R | title text appears as body prose immediately after `# Title` (Nature-style duplication) |
+| S | section count < 4 |
+| H | `### Table N` headings present in body but no `<table>` HTML element |
+| C | longest `*Figure N. ...*` caption > 800 chars (boundary leak) |
+| X | output < 5 KB (extremely short — likely PDF extract failure) |
+| L | output much shorter than spike baseline (requires baseline) |
+| J | Jaccard vs spike < 0.6 (requires baseline) |

{docpluck-2.4.0 → docpluck-2.4.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "docpluck"
-version = "2.4.0"
+version = "2.4.2"
 description = "PDF, DOCX, and HTML text extraction and normalization for academic papers"
 readme = "docs/README.md"
 requires-python = ">=3.10"

docpluck-2.4.2/scripts/verify_corpus_full.py ADDED Viewed

@@ -0,0 +1,288 @@
+"""Full-corpus verifier: run v2.4.0 render across all 101 PDFs in
+PDFextractor/test-pdfs/ and flag papers with structural issues, even those
+without a spike baseline.
+For papers WITH a spike baseline, full metrics (char-ratio, Jaccard, D-tag)
+apply just like in verify_corpus.py.
+For papers WITHOUT a spike baseline (75 of the 101), we apply baseline-free
+heuristics:
+  - title present? non-trivial? not trailing-truncated?
+  - section count >= 4 (most academic papers have at least Abstract +
+    Introduction + Methods/Results + Discussion + References)
+  - rendered length plausible (>5 KB)
+  - title block not duplicated immediately in body (Nature-style)
+Output: one line per paper with status + tags, then a triage section
+listing the top issues for follow-up.
+Usage:
+  python scripts/verify_corpus_full.py
+  python scripts/verify_corpus_full.py --only-fails
+  python scripts/verify_corpus_full.py --paper jama_open_5
+"""
+from __future__ import annotations
+import argparse
+import re
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+REPO_ROOT = Path(__file__).resolve().parent.parent
+APP_PDFS = REPO_ROOT.parent / "PDFextractor" / "test-pdfs"
+SPIKE_OUT_DIRS = [
+    REPO_ROOT / "docs/superpowers/plans/spot-checks/splice-spike/outputs",
+    REPO_ROOT / "docs/superpowers/plans/spot-checks/splice-spike/outputs-new",
+]
+RENDERS_DIR = REPO_ROOT / "tmp" / "renders_v2.4.0"
+_CONNECTOR_TAIL = {
+    "of", "from", "for", "the", "and", "or", "to", "with", "on", "at",
+    "by", "in", "as", "is", "a", "an", "but", "into", "onto", "upon",
+    "than", "that", "which", "who", "when", "where", "while", "during",
+    "after", "before", "because", "since", "though", "although",
+}
+_TITLE_RE = re.compile(r"^\s*#\s+([^\n]+)$", re.MULTILINE)
+_H2_RE = re.compile(r"^\s*##\s+([^\n]+)$", re.MULTILINE)
+_TABLE_HTML_RE = re.compile(r"<table>")
+_FIG_CAPTION_RE = re.compile(r"^\*Figure\s+\d+\.?\s+[^\n]*?\*\s*$", re.MULTILINE)
+def _all_pdfs() -> list[Path]:
+    return sorted(APP_PDFS.rglob("*.pdf"))
+def _find_spike_md(name: str) -> Optional[Path]:
+    for d in SPIKE_OUT_DIRS:
+        p = d / f"{name}.md"
+        if p.exists():
+            return p
+    return None
+def _word_set(text: str) -> set[str]:
+    return set(re.findall(r"[A-Za-z]{4,}", text.lower()))
+def _title_word_delta(rendered_title: Optional[str], spike_title: Optional[str]) -> int:
+    if not rendered_title or not spike_title:
+        return 0
+    rw = set(re.findall(r"[A-Za-z]{4,}", rendered_title.lower()))
+    sw = set(re.findall(r"[A-Za-z]{4,}", spike_title.lower()))
+    return len(sw - rw)
+def _has_immediate_title_repeat(md: str, title: str) -> bool:
+    """True if the first few body paragraphs contain a span whose token
+    content matches the title (the symptom my Nature-style sweep targets).
+    Conservative — should never fire after v2.4.0 unless a regression."""
+    if not title:
+        return False
+    title_tokens = re.findall(r"\w+", title.lower())
+    if len(title_tokens) < 4:
+        return False
+    title_set = set(title_tokens)
+    # Skip the title line itself; scan the next ~30 non-blank body lines.
+    lines = md.split("\n")
+    after_title = False
+    accumulated: list[str] = []
+    n_scanned = 0
+    for ln in lines:
+        line = ln.strip()
+        if not after_title:
+            if line.startswith("# "):
+                after_title = True
+            continue
+        if not line or line.startswith("#"):
+            if accumulated:
+                # check whole accumulated span
+                covered = sum(1 for t in title_tokens if t in accumulated)
+                in_title = sum(1 for t in accumulated if t in title_set)
+                if covered >= 0.8 * len(title_tokens) and in_title >= 0.7 * len(accumulated):
+                    return True
+            accumulated = []
+            continue
+        accumulated.extend(re.findall(r"\w+", line.lower()))
+        n_scanned += 1
+        if n_scanned > 30:
+            break
+    return False
+def _metrics(md: str) -> dict:
+    title_m = _TITLE_RE.search(md)
+    title = title_m.group(1).strip() if title_m else None
+    title_truncated = False
+    if title:
+        stripped = re.sub(r"[\s\.,;:!?\-—–]+$", "", title).lower()
+        last = stripped.rsplit(None, 1)[-1] if " " in stripped else stripped
+        title_truncated = last in _CONNECTOR_TAIL
+    sections = _H2_RE.findall(md)
+    return {
+        "title": title,
+        "title_truncated": title_truncated,
+        "section_count": len(sections),
+        "section_names": sections,
+        "table_html_count": len(_TABLE_HTML_RE.findall(md)),
+        "total_chars": len(md),
+        "title_repeat_in_body": _has_immediate_title_repeat(md, title) if title else False,
+        "longest_fig_caption_chars": max(
+            (len(m.group(0)) for m in _FIG_CAPTION_RE.finditer(md)), default=0
+        ),
+    }
+_CORRECTION_TITLE_RE = re.compile(
+    r"\b(?:addendum|corrigendum|correction|erratum|retraction)\b",
+    re.IGNORECASE,
+)
+def _classify(name: str, md: str, spike_md: Optional[str]) -> tuple[str, dict, list[str]]:
+    m = _metrics(md)
+    tags: list[str] = []
+    title_text = m["title"] or ""
+    is_correction_paper = bool(_CORRECTION_TITLE_RE.search(title_text))
+    if m["title"] is None:
+        tags.append("M")  # missing title
+    if m["title_truncated"]:
+        tags.append("T")
+    if m["section_count"] < 4 and not is_correction_paper:
+        tags.append("S")
+    if m["title_repeat_in_body"]:
+        tags.append("R")  # title repeats in body (Nature-style dup)
+    appendix_idx = md.find("## Tables (unlocated in body)")
+    body_section = md if appendix_idx < 0 else md[:appendix_idx]
+    body_table_count = len(re.findall(r"^\s*###\s+Table\s+\d+", body_section, re.MULTILINE))
+    if body_table_count > 0 and m["table_html_count"] == 0:
+        tags.append("H")
+    if m["longest_fig_caption_chars"] > 800:
+        tags.append("C")
+    # X (short output) is suppressed when the title indicates an ADDENDUM /
+    # CORRIGENDUM / CORRECTION / ERRATUM — these are genuinely 1-page
+    # correction notices and a short render is correct (the
+    # jdm_.2023.10 paper is the canonical case in the 101-PDF corpus).
+    if m["total_chars"] < 5000 and not is_correction_paper:
+        tags.append("X")  # extremely short — likely failure
+    spike_title = None
+    if spike_md:
+        spike_t = _TITLE_RE.search(spike_md)
+        spike_title = spike_t.group(1).strip() if spike_t else None
+    if spike_md:
+        char_ratio = m["total_chars"] / max(1, len(spike_md))
+        my_w = _word_set(md)
+        sp_w = _word_set(spike_md)
+        union = my_w | sp_w
+        jaccard = len(my_w & sp_w) / len(union) if union else None
+        m["char_ratio_vs_spike"] = char_ratio
+        m["jaccard_vs_spike"] = jaccard
+        if char_ratio < 0.7:
+            tags.append("L")
+        if jaccard is not None and jaccard < 0.6:
+            tags.append("J")
+    else:
+        m["char_ratio_vs_spike"] = None
+        m["jaccard_vs_spike"] = None
+    if spike_title:
+        miss = _title_word_delta(m["title"], spike_title)
+        if miss > 0:
+            tags.append("D")
+        m["title_missing_words"] = miss
+    else:
+        m["title_missing_words"] = 0
+    if not tags:
+        status = "PASS"
+    elif set(tags) <= {"L"}:
+        status = "WARN"
+    else:
+        status = "FAIL"
+    return status, m, tags
+def _run_render(pdf_path: Path) -> tuple[str, float]:
+    from docpluck import render_pdf_to_markdown
+    t0 = time.time()
+    data = pdf_path.read_bytes()
+    md = render_pdf_to_markdown(data)
+    return md, time.time() - t0
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--paper")
+    ap.add_argument("--only-fails", action="store_true")
+    ap.add_argument("--save-renders", action="store_true",
+                    help="dump each rendered .md to tmp/renders_v2.4.0/")
+    args = ap.parse_args()
+    if args.paper:
+        pdfs = [p for p in _all_pdfs() if p.stem == args.paper]
+    else:
+        pdfs = _all_pdfs()
+    if not pdfs:
+        print("ERROR: no PDFs found", file=sys.stderr)
+        return 1
+    if args.save_renders:
+        RENDERS_DIR.mkdir(parents=True, exist_ok=True)
+    print(f"# Full-corpus verification — {len(pdfs)} PDFs (v2.4.0)")
+    print(f"# legend: M=missing_title T=title_trunc D=title_words_dropped R=title_repeat_in_body S=few_sections H=missing_html C=cap_too_long X=output_too_short L=much_shorter J=low_jaccard")
+    print()
+    print(f"{'STATUS':6} {'PAPER':40} {'TAGS':15} {'CHARS':>8} {'SECT':>5} {'TABS':>5}  TIME")
+    print("-" * 100)
+    summary = {"PASS": 0, "WARN": 0, "FAIL": 0, "ERROR": 0}
+    failures: list[tuple[str, str, dict, list[str]]] = []
+    for pdf in pdfs:
+        name = pdf.stem
+        spike_path = _find_spike_md(name)
+        spike_md = spike_path.read_text(encoding="utf-8", errors="ignore") if spike_path else None
+        try:
+            md, elapsed = _run_render(pdf)
+        except Exception as e:
+            print(f"{'ERROR':6} {name:40}  {type(e).__name__}: {e}")
+            summary["ERROR"] += 1
+            continue
+        status, m, tags = _classify(name, md, spike_md)
+        summary[status] += 1
+        if status != "PASS":
+            failures.append((name, status, m, tags))
+        if args.only_fails and status == "PASS":
+            continue
+        if args.save_renders:
+            (RENDERS_DIR / f"{name}.md").write_text(md, encoding="utf-8", errors="replace")
+        tag_str = ",".join(tags) or "—"
+        print(f"{status:6} {name:40} {tag_str:15} {m['total_chars']:>8} {m['section_count']:>5} {m['table_html_count']:>5}  {elapsed:.1f}s")
+    print()
+    print("# Summary")
+    total = sum(summary.values())
+    for k in ("PASS", "WARN", "FAIL", "ERROR"):
+        if summary[k]:
+            print(f"  {k:8} {summary[k]:3} / {total}")
+    if failures:
+        print()
+        print("# Failure details")
+        for name, status, m, tags in failures:
+            tag_str = ",".join(tags)
+            print(f"\n  {status} {name} [{tag_str}]")
+            print(f"    title: {repr(m['title'])[:120]}")
+            print(f"    sections={m['section_count']} tables={m['table_html_count']} chars={m['total_chars']}")
+            if m.get("char_ratio_vs_spike") is not None:
+                print(f"    vs_spike: char_ratio={m['char_ratio_vs_spike']:.2f} jaccard={m['jaccard_vs_spike']:.2f} title_missing_words={m.get('title_missing_words', 0)}")
+    return 0 if summary["FAIL"] == 0 and summary["ERROR"] == 0 else 1
+if __name__ == "__main__":
+    sys.exit(main())

docpluck 2.4.0__tar.gz → 2.4.2__tar.gz

docpluck 2.4.0tar.gz → 2.4.2tar.gz