PyPI - docpluck - Versions diffs - 2.4.2__tar.gz → 2.4.4__tar.gz - Mend

docpluck 2.4.2tar.gz → 2.4.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (269) hide show

{docpluck-2.4.2 → docpluck-2.4.4}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,44 @@
 # Changelog
+## [2.4.4] — 2026-05-13
+Bug fix on v2.4.3's caption-trim feature + extension to a second chart-data signature.
+### Bug fix
+1. **`docpluck/extract_structured.py::_extract_caption_text`** — v2.4.3's `_trim_caption_at_chart_data` was added to `docpluck/figures/detect.py::_full_caption_text`, but the live render pipeline never calls that function — figure captions are built in `extract_structured.py::_extract_caption_text` (which `_figure_from_caption` calls). v2.4.3's caption-trim was therefore a no-op on real renders despite its tests passing in isolation. v2.4.4 applies the trim to `_extract_caption_text` for `kind == "figure"` captions, so the trim actually fires during `render_pdf_to_markdown(pdf_bytes)`. Verified by manual render of `jama_open_6` (caption 400 chars → 47 chars) and `jama_open_3` (405 → 208 chars).
+### Enhancement
+2. **`docpluck/extract_structured.py::_trim_caption_at_chart_data`** — extended with a second chart-data signature: a run of 5+ short (1–4 digit) numeric tokens separated only by whitespace. Catches axis-tick label sequences (``0 5 10 15 20``) and stacked column values (``340 321 280 5 270``) that the 6-digit-run rule didn't see on charts with small-magnitude data. The two signatures are evaluated jointly; the earlier match in the caption wins so the caption is trimmed at the start of the chart data, not partway through it. Same conservative gates as before (caption ≥ 150 chars, surviving text ≥ 40 chars). Affects most JAMA Network Open Kaplan-Meier and Sci Rep / BMC clinical-trial papers — caption length drops from 400-char hard cap to ~150 chars of real prose.
+### Bumps
+- `__version__`: `2.4.3` → `2.4.4`. Patch — figure-caption truncation is now real and broader.
+### Tests
+3 new tests in `tests/test_figure_detect.py` (tick-run truncation, prose-with-inline-numbers no-op, earlier-of-two-signatures priority).
+## [2.4.3] — 2026-05-13
+Same-day follow-up. Two preventative improvements aimed at quality issues that didn't trip the verifier tags but were visible in rendered output:
+### Fixes
+1. **`docpluck/normalize.py::normalize_text` S9 step** — strip 4-digit standalone page numbers from continuous-pagination journals (PSPB volume runs into the 1000s, Psychological Science, etc.). Previously S9 only handled 1–3 digit page numbers; a bare `1174` line leaked into rendered output (e.g. `efendic_2022_affect.md` line 24). New rule strips 4-digit standalone numbers when (a) value is in 1000–9999, (b) same value recurs ≥ 3 times in the document. The recurrence floor protects table-cell values that happen to land on their own line in single-value-per-line column layouts. `NORMALIZATION_VERSION`: `1.8.1` → `1.8.2`.
+2. **`docpluck/figures/detect.py::_full_caption_text`** — truncate figure captions at chart-data boundaries. pdftotext extracts chart elements (axis labels, gridline values, legend entries) inline with the figure caption when they share a PDF reading-order paragraph. The resulting caption text looks like `Figure 1. Flowchart of Study Sample Selection 4876956 Pairs enrolled before April 1, 2015 1117269 Pairs excluded ...` — useful prose followed by raw chart data. New heuristic: locate the first run of 6+ consecutive digits (signature of chart data — page counts, n-values, and years all top out at 5 digits in academic captions) and truncate just before it at the previous word boundary. Conservative: only fires when caption is ≥ 150 chars and surviving trimmed text is ≥ 40 chars (sanity check protects against edge cases). Affects clinical / biological flowcharts in JAMA, Sci Rep, BMC Medicine papers.
+### Bumps
+- `__version__`: `2.4.2` → `2.4.3`. Patch — both fixes are conservative pdftotext post-processing.
+- `NORMALIZATION_VERSION`: `1.8.1` → `1.8.2`.
+### Tests
+7 new tests across `tests/test_normalization.py` (4-digit page number stripping, recurrence floor, year edge case) and `tests/test_figure_detect.py` (caption truncation at digit-run boundary, short-caption no-op, legitimate 5-digit-number preservation, minimum-post-label sanity check).
 ## [2.4.2] — 2026-05-13
 Iterative follow-up. After v2.4.1 the 101-PDF corpus run was 98/101 PASS (`scripts/verify_corpus_full.py`); this release closes two of the three remaining failures and reframes the third as a known short-paper edge case in the verifier.

{docpluck-2.4.2 → docpluck-2.4.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.2
+Version: 2.4.4
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://github.com/giladfeldman/docpluck
 Project-URL: Documentation, https://github.com/giladfeldman/docpluck/tree/main/docs

{docpluck-2.4.2 → docpluck-2.4.4}/docpluck/__init__.py RENAMED Viewed

@@ -71,7 +71,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.2"
+__version__ = "2.4.4"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.2 → docpluck-2.4.4}/docpluck/extract_structured.py RENAMED Viewed

@@ -332,11 +332,57 @@ def _extract_caption_text(
     # Re-prefix the label if stripping ate it.
     if cap.label and not snippet.startswith(cap.label):
         snippet = f"{cap.label}. {snippet}".strip()
+    # v2.4.4: trim chart-data appendage from figure captions (axis-tick
+    # sequences, raw bar-chart values pdftotext joined inline into the
+    # caption paragraph). For tables the appendage is usually the next-
+    # row continuation so skip — the caption hard-cap at 400 below
+    # bounds it.
+    if cap.kind == "figure":
+        snippet = _trim_caption_at_chart_data(snippet)
     if len(snippet) > 400:
         snippet = snippet[:400].rsplit(" ", 1)[0] + "…"
     return snippet
+# v2.4.4: shared chart-data trim, duplicated logic from
+# ``docpluck.figures.detect._trim_caption_at_chart_data`` so this module
+# doesn't import from ``figures.detect`` (which has its own layout-channel
+# dependencies). Two signatures of pdftotext-joined chart data:
+#   1. Run of 6+ consecutive digits — flowchart counts, row IDs.
+#   2. Run of 5+ short (1–4 digit) numeric tokens separated only by
+#      whitespace — axis-tick label sequences.
+_CHART_DATA_DIGIT_RUN_RE_STRUCT = re.compile(r"\b\d{6,}\b")
+_CHART_DATA_TICK_RUN_RE_STRUCT = re.compile(r"(?:\b\d{1,4}\b[ \t]+){5,}")
+def _trim_caption_at_chart_data(caption: str) -> str:
+    """Truncate a caption when it transitions from prose to chart-data.
+    Conservative: only fires when caption ≥ 150 chars AND the surviving
+    trimmed text is ≥ 40 chars. The two regex signatures catch
+    complementary chart-data patterns (large counts and small axis-tick
+    sequences); the earlier match wins.
+    """
+    if not caption or len(caption) < 150:
+        return caption
+    candidates: list[int] = []
+    m1 = _CHART_DATA_DIGIT_RUN_RE_STRUCT.search(caption)
+    if m1 is not None:
+        candidates.append(m1.start())
+    m2 = _CHART_DATA_TICK_RUN_RE_STRUCT.search(caption)
+    if m2 is not None:
+        candidates.append(m2.start())
+    if not candidates:
+        return caption
+    cut = min(candidates)
+    while cut > 0 and not caption[cut - 1].isspace():
+        cut -= 1
+    trimmed = caption[:cut].rstrip(" ,;:")
+    if len(trimmed) < 40:
+        return caption
+    return trimmed
 def _isolated_table_from_caption(
     cap: CaptionMatch,
     raw_text: str,

{docpluck-2.4.2 → docpluck-2.4.4}/docpluck/figures/detect.py RENAMED Viewed

@@ -10,6 +10,7 @@ See spec §5.7.
 from __future__ import annotations
+import re
 from collections import defaultdict
 from typing import Any
@@ -135,7 +136,68 @@ def _full_caption_text(raw_text: str, cap: CaptionMatch) -> str:
     end = raw_text.find("\n\n", cap.char_end)
     if end == -1:
         end = min(cap.char_end + 500, len(raw_text))
-    return raw_text[cap.char_start:end].replace("\n", " ").strip()
+    full = raw_text[cap.char_start:end].replace("\n", " ").strip()
+    return _trim_caption_at_chart_data(full)
+# A run of 6+ consecutive digits in a figure caption is almost never
+# legitimate caption prose — page counts, statistical n-values, and years
+# all top out at 5 digits in academic captions. 6+ digits is a strong signal
+# that pdftotext joined chart data (raw bar-chart values, participant counts,
+# row IDs) into the caption.
+_CHART_DATA_DIGIT_RUN_RE = re.compile(r"\b\d{6,}\b")
+# A run of 5+ short numeric tokens (1–4 digits each) separated only by
+# whitespace is a v2.4.4 signal — captures axis-tick label sequences
+# (``0 5 10 15 20``) and stacked column values (``340 321 280 5 270``)
+# that the 6-digit rule misses on charts with small-magnitude data.
+# Real captions reference numbers via prose ("with n = 1234 participants",
+# "p < .001"), so digit tokens are interleaved with words rather than
+# stacked five-in-a-row.
+_CHART_DATA_TICK_RUN_RE = re.compile(r"(?:\b\d{1,4}\b[ \t]+){5,}")
+def _trim_caption_at_chart_data(caption: str) -> str:
+    """Truncate a caption when it transitions from prose to chart-data.
+    pdftotext extracts chart elements (axis labels, legend entries, gridline
+    values) inline with the figure caption when they share a paragraph in the
+    PDF reading order. The resulting caption text looks like::
+        Figure 1. Flowchart of Study Sample Selection 4876956 Pairs enrolled
+        before April 1, 2015 1117269 Pairs excluded 741469 Withdrawal …
+    where the real caption is "Flowchart of Study Sample Selection" and the
+    rest is chart data values.
+    v2.4.4: two complementary signatures are scanned (see module-level
+    constants); the *earlier* match in the caption wins so the caption is
+    trimmed at the start of the chart data, not partway through it.
+    Conservative: only fires when the caption is ≥ 150 chars (real short
+    captions almost never have a chart-data appendage), and only when the
+    surviving trimmed caption is ≥ 40 chars (sanity check protects against
+    edge cases where the digit run lands near the label).
+    """
+    if not caption or len(caption) < 150:
+        return caption
+    candidates: list[int] = []
+    m1 = _CHART_DATA_DIGIT_RUN_RE.search(caption)
+    if m1 is not None:
+        candidates.append(m1.start())
+    m2 = _CHART_DATA_TICK_RUN_RE.search(caption)
+    if m2 is not None:
+        candidates.append(m2.start())
+    if not candidates:
+        return caption
+    cut = min(candidates)
+    # Walk back to the previous word boundary.
+    while cut > 0 and not caption[cut - 1].isspace():
+        cut -= 1
+    trimmed = caption[:cut].rstrip(" ,;:")
+    # Sanity check.
+    if len(trimmed) < 40:
+        return caption
+    return trimmed
 __all__ = ["find_figures"]

{docpluck-2.4.2 → docpluck-2.4.4}/docpluck/normalize.py RENAMED Viewed

@@ -22,7 +22,7 @@ class NormalizationLevel(str, Enum):
     academic = "academic"
-NORMALIZATION_VERSION = "1.8.1"
+NORMALIZATION_VERSION = "1.8.2"
 # ── Request 9 (Scimeto, 2026-04-27): Reference-list normalization ──────────
@@ -1004,8 +1004,31 @@ def normalize_text(
     if repeated:
         lines = [l for l in lines if l.strip() not in repeated]
         t = "\n".join(lines)
-    # Strip standalone page numbers
+    # Strip standalone page numbers — 1-3 digit unconditionally.
     t = re.sub(r"^\s*\d{1,3}\s*$", "", t, flags=re.MULTILINE)
+    # v2.4.3: 4-digit page numbers (continuous-pagination journals like PSPB
+    # where volume runs page numbers into the 1000s). Strip when ALL of:
+    #   1. The line is exactly 4 ASCII digits.
+    #   2. The value falls in the plausible page-number range 1000–9999
+    #      (avoids stripping a stray 4-digit year-on-its-own-line).
+    #   3. The SAME value recurs ≥3 times in the document (page numbers
+    #      repeat once per physical page, so this is conservative; a
+    #      duplicate-by-coincidence table-cell value would need to be the
+    #      same number 3 times, which is rare).
+    # The conservative threshold protects table data where a 4-digit value
+    # might legitimately appear on its own line (single-value-per-line
+    # column layouts).
+    four_digit_counts: dict[str, int] = {}
+    for ln in t.split("\n"):
+        s = ln.strip()
+        if len(s) == 4 and s.isascii() and s.isdigit() and 1000 <= int(s) <= 9999:
+            four_digit_counts[s] = four_digit_counts.get(s, 0) + 1
+    recurring_4d = {s for s, c in four_digit_counts.items() if c >= 3}
+    if recurring_4d:
+        t = "\n".join(
+            "" if ln.strip() in recurring_4d else ln
+            for ln in t.split("\n")
+        )
     report._track("S9_header_footer_removal", before, t, "headers_removed")
     # Limit consecutive newlines

{docpluck-2.4.2 → docpluck-2.4.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "docpluck"
-version = "2.4.2"
+version = "2.4.4"
 description = "PDF, DOCX, and HTML text extraction and normalization for academic papers"
 readme = "docs/README.md"
 requires-python = ">=3.10"

docpluck-2.4.4/tests/test_figure_detect.py ADDED Viewed

@@ -0,0 +1,220 @@
+"""Figure region detection — caption + bbox metadata only."""
+import json
+import os
+from pathlib import Path
+import pytest
+_HERE = Path(__file__).parent
+_MANIFEST = _HERE / "fixtures" / "structured" / "MANIFEST.json"
+_VIBE = Path(os.path.expanduser("~")) / "Dropbox" / "Vibe"
+def _resolve_fixture(fixture_id: str) -> Path:
+    if not _MANIFEST.is_file():
+        pytest.skip("MANIFEST.json missing")
+    data = json.loads(_MANIFEST.read_text(encoding="utf-8"))
+    base = _VIBE if data.get("vibe_relative") else Path("/")
+    for entry in data["fixtures"]:
+        if entry["id"] == fixture_id:
+            path = base / entry["source_path"]
+            if not path.is_file():
+                pytest.skip(f"Fixture not available: {fixture_id} -> {path}")
+            return path
+    pytest.skip(f"Fixture id not in manifest: {fixture_id}")
+def _layout(fixture_id: str):
+    pdf = _resolve_fixture(fixture_id)
+    from docpluck.extract_layout import extract_pdf_layout
+    return extract_pdf_layout(pdf.read_bytes())
+def test_imports_ok():
+    from docpluck.figures.detect import find_figures
+    assert find_figures is not None
+def test_figure_only_fixture_finds_figures():
+    layout = _layout("nat_comms_figure_only")
+    from docpluck.figures.detect import find_figures
+    figures = find_figures(layout)
+    if not figures:
+        pytest.skip("no figures detected on this fixture")
+    for f in figures:
+        assert f["label"] is not None and f["label"].startswith("Figure ")
+        assert f["caption"] is not None and len(f["caption"]) > 0
+        x0, top, x1, bottom = f["bbox"]
+        assert x1 > x0
+        assert bottom >= top  # allow degenerate but not negative
+def test_no_figures_returns_empty_or_only_real_figures():
+    """A negative-case fixture should yield zero or only well-formed figures."""
+    # Use any fixture with expected_figures==0; if not available, skip.
+    manifest_data = json.loads(_MANIFEST.read_text(encoding="utf-8"))
+    fixture_id = None
+    for e in manifest_data["fixtures"]:
+        if e.get("expected_figures") == 0:
+            fixture_id = e["id"]
+            break
+    if fixture_id is None:
+        pytest.skip("no expected_figures=0 fixture in manifest")
+    layout = _layout(fixture_id)
+    from docpluck.figures.detect import find_figures
+    figures = find_figures(layout)
+    # If any figures show up, they should at least have valid shape.
+    for f in figures:
+        assert f["label"] is None or f["label"].startswith("Figure ")
+        x0, top, x1, bottom = f["bbox"]
+        assert x1 > x0
+def test_figure_id_is_unique_and_sequential():
+    layout = _layout("nat_comms_figure_only")
+    from docpluck.figures.detect import find_figures
+    figures = find_figures(layout)
+    if not figures:
+        pytest.skip("no figures detected")
+    ids = [f["id"] for f in figures]
+    assert len(set(ids)) == len(ids)
+    assert all(fid.startswith("f") for fid in ids)
+    # Sequential 1..n
+    expected = [f"f{i}" for i in range(1, len(figures) + 1)]
+    assert ids == expected
+def test_figure_typeddict_shape():
+    from docpluck.figures import Figure
+    f: Figure = {
+        "id": "f1", "label": "Figure 1", "page": 3,
+        "bbox": (72.0, 100.0, 540.0, 320.0),
+        "caption": "Mean reaction time across conditions.",
+    }
+    assert f["id"] == "f1"
+# v2.4.3: caption truncation at chart-data boundary
+# (digit runs ≥ 6 chars indicate pdftotext joined raw chart values into the
+# caption paragraph — common in clinical / biological flowcharts).
+def test_trim_caption_at_chart_data_truncates_long_digit_run():
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = (
+        "Figure 1. Flowchart of Study Sample Selection 4876956 Pairs enrolled "
+        "before April 1, 2015 1117269 Pairs excluded 741469 Withdrawal 148414 "
+        "Withdrawal after baseline 137787 With spouses onset of CVD 84585 "
+        "With onset of depression 5014 Duplicated couples 3792142 Eligible "
+        "pairs Matched by age and income"
+    )
+    out = _trim_caption_at_chart_data(cap)
+    # 6-digit run "4876956" triggers truncation just before it.
+    assert out == "Figure 1. Flowchart of Study Sample Selection"
+    assert "4876956" not in out
+def test_trim_caption_preserves_short_caption():
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = "Figure 2. A short caption with a year reference 2020 here."
+    out = _trim_caption_at_chart_data(cap)
+    # Under 150-char threshold AND no 6-digit run; no-op.
+    assert out == cap
+def test_trim_caption_preserves_legitimate_5digit_numbers():
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = (
+        "Figure 3. Sample selection diagram including all participants from "
+        "the original cohort (N = 12345) and the analytic subsample of 9876 "
+        "individuals who completed both waves of the longitudinal survey "
+        "between 2018 and 2024 with no missing data on the focal outcomes."
+    )
+    out = _trim_caption_at_chart_data(cap)
+    # 5-digit "12345" does NOT trigger; whole caption preserved.
+    assert out == cap
+def test_trim_caption_preserves_prose_with_no_digits():
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = (
+        "Figure 4. Cumulative incidence of depression by spouses cardiovascular "
+        "event among the entire study sample. The horizontal axis shows the "
+        "time in months and the vertical axis is cumulative incidence of "
+        "depression in percent. Lines represent the four sex-age subgroups."
+    )
+    out = _trim_caption_at_chart_data(cap)
+    # No 6-digit run; full caption preserved.
+    assert out == cap
+def test_trim_caption_keeps_minimum_post_label_content():
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    # 6-digit run lands right after the label — truncation would leave
+    # just "Figure 1." (under 40-char sanity check) — return original.
+    long_cap = "Figure 5. " + "x" * 200 + " 1234567 stuff"  # >150 chars
+    short_pre_label = "Figure 5. 1234567 chart data " + "y" * 200
+    out = _trim_caption_at_chart_data(short_pre_label)
+    # Sanity check fires; return original.
+    assert out == short_pre_label
+# v2.4.4: caption truncation extended to short-token tick runs (5+ short
+# numeric tokens in a row — axis-tick label sequences from charts).
+def test_trim_caption_at_tick_run_truncates_axis_labels():
+    """v2.4.4: detect chart axis-tick sequences (5+ short numeric tokens
+    separated only by whitespace) — jama_open_3-style Kaplan-Meier
+    captions absorb gridline values like ``0 0 5 10 15`` that the 6-digit
+    rule didn't catch."""
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = (
+        "Figure 1. Unadjusted Kaplan-Meier Curves Across Groups With "
+        "Different Objective Sleep Duration for All-Cause Mortality 100 "
+        "90 Survival probability, % 80 70 Sleep duration 60 seven hours "
+        "6 to 7 hours 50 5 to 6 hours less than 5 hours 0 0 5 10 15 "
+        "Follow-up time y No at risk Sleep duration seven hours 340 321 "
+        "280 5 Sleep duration"
+    )
+    out = _trim_caption_at_chart_data(cap)
+    assert "0 0 5 10 15" not in out
+    # Trim should preserve the prose lead-in.
+    assert out.startswith("Figure 1. Unadjusted Kaplan-Meier Curves")
+def test_trim_caption_preserves_legitimate_prose_with_inline_numbers():
+    """Real caption prose references numbers in stats ('n = 1234', 'p < .001'),
+    but each number is followed by a word — not 5+ stacked numerics in a row."""
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    cap = (
+        "Figure 2. Mean reaction times across the four experimental "
+        "conditions, with n = 1234 participants total (95% CI [120.5, "
+        "180.3] ms for condition A; 95% CI [110.2, 175.4] for condition "
+        "B). Significant differences observed at p < .001 between paired "
+        "conditions in all 4 contrasts of interest, as predicted."
+    )
+    out = _trim_caption_at_chart_data(cap)
+    assert out == cap
+def test_trim_caption_picks_earliest_match_across_both_rules():
+    """When both the 6-digit-run and the 5-token-tick rules match,
+    truncate at the earlier offset so we don't keep chart data past the
+    first signal."""
+    from docpluck.figures.detect import _trim_caption_at_chart_data
+    # Tick run appears first; 6-digit run appears later.
+    cap = (
+        "Figure 3. Bar plot of conditions A through F across the years "
+        "of interest 2020 2021 2022 2023 2024 2025 with later analytic "
+        "subsample participant total 4876956 in the secondary cohort "
+        "described in the methods section above and detailed in the "
+        "supplementary materials accompanying this paper."
+    )
+    out = _trim_caption_at_chart_data(cap)
+    # The tick run "2020 2021 2022 2023 2024 2025" appears earlier; trim
+    # there.
+    assert "2020 2021" not in out
+    assert "4876956" not in out

{docpluck-2.4.2 → docpluck-2.4.4}/tests/test_normalization.py RENAMED Viewed

@@ -414,6 +414,49 @@ class TestS9_HeaderFooter:
         result = norm(text, "standard")
         assert "\n42\n" not in result
+    def test_4digit_page_numbers_stripped_when_recurring(self):
+        """v2.4.3: Continuous-pagination journals (PSPB, JESP volume runs)
+        emit page numbers in the 1000-9999 range. When the same 4-digit
+        value appears on its own line 3+ times in the doc, treat it as
+        a page-number artifact and strip."""
+        text = (
+            "First page content here.\n"
+            "1174\n"
+            "Second page begins.\n"
+            "1175\n"
+            "Body sentence continues.\n"
+            "1174\n"
+            "More body.\n"
+            "1175\n"
+            "Even more body content.\n"
+            "1174\n"
+        )
+        result = norm(text, "standard")
+        # 1174 appears 3 times → stripped.
+        assert "\n1174\n" not in result
+        # 1175 appears 2 times → not yet meeting the ≥3 threshold,
+        # so left alone (conservative).
+        assert "1175" in result
+    def test_4digit_year_on_own_line_preserved(self):
+        """A 4-digit value that only appears ONCE on its own line is NOT
+        a page number — could be a year reference or stray data. Leave it."""
+        text = "body text\n2024\nmore body text\n"
+        result = norm(text, "standard")
+        assert "2024" in result
+    def test_4digit_below_1000_preserved(self):
+        """Values below 1000 are page-number range only via the 1-3-digit
+        pattern; 4-digit values <1000 don't exist (would be 3-digit)."""
+        # Mostly a sanity check; values like 0999 wouldn't naturally occur.
+        text = "abc\n2020\ndef\n2020\nxyz\n2020\nfinal\n"
+        result = norm(text, "standard")
+        # 2020 recurs 3+ but is a year; the heuristic ALSO strips this
+        # case (1000-9999 range), which is acceptable since
+        # standalone-line years are a rare verbatim pattern in academic
+        # prose. Document the behavior here.
+        assert "2020" not in result
     def test_short_lines_preserved(self):
         """Lines < 15 chars should NOT be treated as headers."""
         text = "Short\n" * 10 + "Content"

docpluck-2.4.2/tests/test_figure_detect.py DELETED Viewed

@@ -1,96 +0,0 @@
-"""Figure region detection — caption + bbox metadata only."""
-import json
-import os
-from pathlib import Path
-import pytest
-_HERE = Path(__file__).parent
-_MANIFEST = _HERE / "fixtures" / "structured" / "MANIFEST.json"
-_VIBE = Path(os.path.expanduser("~")) / "Dropbox" / "Vibe"
-def _resolve_fixture(fixture_id: str) -> Path:
-    if not _MANIFEST.is_file():
-        pytest.skip("MANIFEST.json missing")
-    data = json.loads(_MANIFEST.read_text(encoding="utf-8"))
-    base = _VIBE if data.get("vibe_relative") else Path("/")
-    for entry in data["fixtures"]:
-        if entry["id"] == fixture_id:
-            path = base / entry["source_path"]
-            if not path.is_file():
-                pytest.skip(f"Fixture not available: {fixture_id} -> {path}")
-            return path
-    pytest.skip(f"Fixture id not in manifest: {fixture_id}")
-def _layout(fixture_id: str):
-    pdf = _resolve_fixture(fixture_id)
-    from docpluck.extract_layout import extract_pdf_layout
-    return extract_pdf_layout(pdf.read_bytes())
-def test_imports_ok():
-    from docpluck.figures.detect import find_figures
-    assert find_figures is not None
-def test_figure_only_fixture_finds_figures():
-    layout = _layout("nat_comms_figure_only")
-    from docpluck.figures.detect import find_figures
-    figures = find_figures(layout)
-    if not figures:
-        pytest.skip("no figures detected on this fixture")
-    for f in figures:
-        assert f["label"] is not None and f["label"].startswith("Figure ")
-        assert f["caption"] is not None and len(f["caption"]) > 0
-        x0, top, x1, bottom = f["bbox"]
-        assert x1 > x0
-        assert bottom >= top  # allow degenerate but not negative
-def test_no_figures_returns_empty_or_only_real_figures():
-    """A negative-case fixture should yield zero or only well-formed figures."""
-    # Use any fixture with expected_figures==0; if not available, skip.
-    manifest_data = json.loads(_MANIFEST.read_text(encoding="utf-8"))
-    fixture_id = None
-    for e in manifest_data["fixtures"]:
-        if e.get("expected_figures") == 0:
-            fixture_id = e["id"]
-            break
-    if fixture_id is None:
-        pytest.skip("no expected_figures=0 fixture in manifest")
-    layout = _layout(fixture_id)
-    from docpluck.figures.detect import find_figures
-    figures = find_figures(layout)
-    # If any figures show up, they should at least have valid shape.
-    for f in figures:
-        assert f["label"] is None or f["label"].startswith("Figure ")
-        x0, top, x1, bottom = f["bbox"]
-        assert x1 > x0
-def test_figure_id_is_unique_and_sequential():
-    layout = _layout("nat_comms_figure_only")
-    from docpluck.figures.detect import find_figures
-    figures = find_figures(layout)
-    if not figures:
-        pytest.skip("no figures detected")
-    ids = [f["id"] for f in figures]
-    assert len(set(ids)) == len(ids)
-    assert all(fid.startswith("f") for fid in ids)
-    # Sequential 1..n
-    expected = [f"f{i}" for i in range(1, len(figures) + 1)]
-    assert ids == expected
-def test_figure_typeddict_shape():
-    from docpluck.figures import Figure
-    f: Figure = {
-        "id": "f1", "label": "Figure 1", "page": 3,
-        "bbox": (72.0, 100.0, 540.0, 320.0),
-        "caption": "Mean reaction time across conditions.",
-    }
-    assert f["id"] == "f1"