PyPI - docpluck - Versions diffs - 2.4.90__tar.gz → 2.4.91__tar.gz - Mend

docpluck 2.4.90tar.gz → 2.4.91tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (413) hide show

{docpluck-2.4.90 → docpluck-2.4.91}/.claude/skills/_project/lessons.md RENAMED Viewed

@@ -576,6 +576,12 @@ Rotation picks `pool[(N mod L) : (N mod L) + rotation_size]` wrapping. Over `cei
 **How to diagnose:** 3-way check — pdftotext vs pdfplumber vs a visual/AI read of the PDF. When both deterministic extractors agree on the wrong value and only the visual disagrees, the codepoint is baked wrong and NO text-channel logic can fix it (recovery needs OCR / multimodal-glyph-consensus, a new subsystem). User decision 2026-06-08: **document as a known limitation, do not scope an OCR subsystem.** Consumer guidance: downstream stat-checkers (CitationGuard) must cross-verify digits against CrossRef/visual — docpluck cannot guarantee a digit matches the visual glyph when the publisher baked the wrong codepoint.
+## RC-1 Step 2 — per-band column re-extraction; the word-preservation guard is the safety (2026-06-15, v2.4.90)
+**What:** the dominant two-column-interleave defect (Method/Results/Discussion scrambled) on table-bearing pages that Step 1 (`extract_page_text_columns`, whole-page) cannot reach — its bilateral y-row gate + full-height gutter strip reject any page with a full-width band crossing the centre (confirmed: `DOCPLUCK_COLUMN_CORRECT_GENERAL=1` was byte-near-identical on the failing papers). **Step 2** (`extract_page_text_banded`, `docpluck/extract_columns.py`) segments a page into horizontal y-bands, column-corrects prose bands, keeps full-width (table/banner/title) bands intact; applied as a **fallback inside `splice_column_corrected_pages`** only when whole-page returns "", under the SAME word-preservation guard. Ship-dark behind `DOCPLUCK_COLUMN_CORRECT_BANDED` (default OFF → flag-OFF byte-identical, 26/26 baseline unchanged). AI-verified ON_BETTER on chan_feldman + chandrashekar (0 text-loss/halluc/regression).
+**The load-bearing lesson:** the word-preservation guard (substantial-word multiset of the re-extraction == original page) makes ANY segmentation heuristic SAFE — a bad reorder is rejected, the page kept as-is — so optimize segmentation for COVERAGE, not for never-being-wrong. Validate with a corpus word-multiset scan (flag-OFF vs flag-ON whole-doc multiset MUST be identical, `lost=0 gained=0`) BEFORE AI-verify. Three hazards the guard caught: (1) full-width title lines column-split mid-word → a row is 2-col only if the strip `[gx±4pt]` is glyph-free, not merely "no word spans gx"; (2) band cuts bisecting tall title glyphs → merge vertically-overlapping bands to full-width; (3) per-row both-sides is conservative but halves guard-rejections vs gutter-clear-only on hard pages (6 vs 12 of 71) — keep it, layer banded as a fallback so clean 2-col pages use the proven whole-page path. Remaining before default-flip: band-cut clips (6/71, guard-rejected), title+sidebar pages — see `docs/superpowers/specs/2026-06-08-rc1-region-aware-column-architecture.md` "Step 2 — remaining work". Shared card: `band-reextraction-lean-on-word-preservation-guard`.
 ## Dropped-glyph recovery splits into layout-recoverable vs pixel-only — probe per-instance before designing (2026-06-15, v2.4.89 W0h)
 **What:** A glyph that **pdftotext drops entirely** (emits nothing) is NOT one class but two, and a per-*instance* 3-way diff tells them apart:
@@ -585,3 +591,27 @@ Rotation picks `pool[(N mod L) : (N mod L) + rotation_size]` wrapping. Over `cei
 **Trap:** "the layout channel recovers what pdftotext drops" is true only for sub-case 1. Probe the SPECIFIC failing instances (geometry: is there a char/line/rect immediately left of the number?) before assuming a layout fix is complete — feasibility here was 3/4, and a recovery that silently fixes 3 of 4 sign-flips is a product call (false-confidence), surfaced to the user.
 **Plumbing gotcha (cost a detour):** the section/render path calls `normalize_text` WITHOUT `layout=` (`sections/__init__.py`), so F0 and every layout-gated pass is OFF there by design (text-channel-only contract). A layout-aware fix must thread a **dedicated** param (`dropped_minus_layout`, `render → extract_sections → normalize_text`) — reusing the `layout=` gate would also switch F0 on and risk broad regressions. The detector must cluster chars into lines by **y-overlap**, never `round(top)` (a minus sits ~0.4pt off its digits' baseline and rounds into a different bucket, orphaning it).
+## On resume, `git status` BEFORE editing source — a concurrent session can co-edit the same files (2026-06-16)
+**What:** A `/docpluck-iterate` resume opened on the RC-1 Step 2 handoff. The system-prompt git snapshot said "(clean)", so I went straight to implementing Step 2 (`extract_page_text_bands` + helpers in `extract_columns.py`). But a SECOND Claude session was concurrently implementing the SAME feature (`extract_page_text_banded` + splice/extract.py wiring + `DOCPLUCK_COLUMN_CORRECT_BANDED` flag). Two `Edit`/`Write` "File has been modified since read" events fired on files I hadn't changed; `extract.py` got an mtime I never wrote; 18 `claude.exe` procs were live. The other session committed first (`git add` swept MY uncommitted duplicate into the **v2.4.90 release commit `1325d14`** → orphaned dead code shipped to the tag + prod, inert/uncalled but wrong). Resolution: confirmed the duplicate was uncalled (distinct names, no shadowing), removed my whole block, 69 column-path tests green, committed the removal.
+**Lessons:** (1) **The system-prompt git snapshot is from conversation START and goes stale within the session** — on any resume, run a fresh `git status` + `git log --oneline -5` BEFORE the first source edit. (2) **"File has been modified since read" on a file YOU didn't touch = STOP and check for a concurrent editor** (`git diff HEAD`, file mtimes, `tasklist | grep claude`), don't just re-read-and-retry. (3) When two agents share a working tree, whoever commits first sweeps the other's uncommitted changes into THEIR commit — so a concurrent edit is not just a merge risk, it can silently ship your half-done work in someone else's release. (4) When you discover the collision, **surface it to the user** (they know if two sessions are intentional) rather than racing or unilaterally reverting (reverting is itself a write that can clobber the other session's in-flight work).
+## Text-channel heading promotion can't use line-WIDTH to gate — earlier render steps join wrapped lines (2026-06-17)
+**What:** Resuming docpluck-iterate on the cycle-2 canary FAILs, user chose "headings first." Root-caused the JESP/Elsevier single-column subsection demotion: labels like "Overview", "Practice instructions", "Self-control assessment" are emitted by pdftotext on their own line with NO blank padding on EITHER side (glued between the prior subsection's body and their own body), so every existing promoter — all of which require `blank_before AND blank_after` — skips them. A minimal relaxation of the `blank_before` gate in `_promote_isolated_titlecase_subsection_headings` (admit no-blank-before when the prior line is a sentence-terminated PROSE line, i.e. a clean paragraph boundary, not a mid-sentence column-wrap) **fixes ar_apa perfectly** (+`### Overview` / `### Practice instructions` / `### Self-control assessment`, 0 removed). BUT it over-promotes **5 two-column table-cell / measures-list labels on ip_feldman** (`### Others ratings`, `### Address order effects`, `### Prevalence Estimation Error: …`) — the G5d hallucinated-heading blocker. Reverted; no release.
+**The trap (durable):** The obvious discriminator — "real single-column heading is followed by full-page-width body (~60-90 char lines); a narrow two-column table cell is surrounded by ~30-char lines" — **WORKS on RAW pdftotext text but is USELESS at the promoter**, because earlier render-pipeline steps JOIN wrapped lines into long paragraphs before `_promote_isolated_titlecase_subsection_headings` runs. Measured: "Others ratings" body max line width is **36 chars raw → 112 chars in-pipeline**; "Address order effects" 32 → 80. A body-width gate computed inside the promoter therefore admits everything and does nothing.
+**Lessons:** (1) Any heading promote/demote heuristic that needs LINE-WIDTH or LINE-WRAP structure must be computed from the **raw pdftotext output (before line-joining)** and threaded into the promoter as a precomputed signal (e.g. a document-level `is_single_column` flag from median raw body-line width ≥ ~62, or per-region width), NOT recomputed inside the render-pipeline promoter where the signal is already destroyed. (2) The safe general fix for this defect class is to **scope the no-blank-padding relaxation to single-column documents** (computed from raw text or the layout channel) — two-column subsection headings are already handled by the blank-isolation / chain paths, so single-column-gating both fixes JESP/Elsevier AND prevents the two-column table-cell over-promotion. (3) Always verify a promotion change against the **G5d canary (ip_feldman) with a deterministic heading-count delta** before trusting it — `diff <(grep -E '^#{1,4} ' before) <(grep -E '^#{1,4} ' after)` instantly shows added/removed headings per paper; ar_apa gaining exactly 3 and ip_feldman gaining 5 was the whole story in one command.
+## Single-column gate via raw-text wide-line fraction RESOLVES the JESP glued-heading blocker — and AI-verify catches a new single-column false promotion (2026-06-17 v2.4.91)
+**What:** Implemented the precisely-scoped next step from the (same-day) blocked entry above. `_raw_text_is_single_column(raw_text)` = fraction of non-blank RAW-pdftotext lines wider than 65 chars ≥ 0.25, threaded into `_promote_isolated_titlecase_subsection_headings(text, *, is_single_column)`; the hard `blank_before` reject is relaxed ONLY when single-column AND `_prev_paragraph_is_sentence_terminated`. Result: `ar_apa_j_jesp_2009_12_011` gains exactly `### Overview` / `### Practice instructions` / `### Self-control assessment` (AI-verified vs the article-finder reading gold: `new_headings_are_real=true`), two-column `ip_feldman_2025_pspb` byte-identical (G5d trap avoided), 26-baseline 26/26.
+**Why `frac>65` and NOT median or interleave-pages:** The corpus measurement (all 26 baseline papers) showed median misclassifies single-column table-heavy papers (plos_med median 48 but genuinely single-column), and the column-interleave page count is useless — the genuine single-column target ar_apa has 4/5 pages flagged by `_detect_column_interleave_pages` (short reference/table lines trip it). `frac>65` is the physical invariant: a two-column layout *cannot* emit many >65-char lines (each column wraps at ~30-48), so it stays 0.06–0.24; single-column body prose wraps full-width → 0.28–0.58. The gap (0.235→0.280) is clean and corpus-wide.
+**The AI-verify catch (why Phase-5d is non-negotiable):** the single-column relaxation ALSO promoted `### Anesthesiologists; CI, confidence interval; DSMB,` on plos_med_1 — an abbreviation-glossary line, a NEW hallucinated heading (absent at HEAD, confirmed by a stash+render heading-delta). A deterministic heading-delta scan of the 15 single-column papers had NOT flagged it as wrong (it was a real heading-count delta); only the Sonnet AI-verify against the gold identified it as a HALLUCINATION. Fix: extend `_is_single_col_relaxation_fragment` to reject any candidate containing an internal `;` or ending in `,` (a heading never does; an abbreviation/clause list always does). After the guard, plos_med net +0, all legitimate headings retained.
+**How to detect next time:** (1) For any layout-dependent render heuristic, measure the discriminator across the WHOLE baseline corpus before fixing the threshold — don't tune to make one paper pass (median would have). (2) A heading-count *delta* is necessary but NOT sufficient evidence a promotion is correct — a +1 delta can be a hallucination; the AI-gold verify is what distinguishes a real subsection from a promoted glossary/table-cell line. Always AI-verify every single-column paper whose heading count changes, not just the target. (3) Heading false-positive shapes recur per call-site: bracket furniture, leading-preposition wraps, dangling-connector tails, and `;`/trailing-`,` list fragments are the standing reject set for any *relaxed* promoter.

{docpluck-2.4.90 → docpluck-2.4.91}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -1198,3 +1198,17 @@ aren't skipped.
 ### Process notes
 - One AI-verify cycle surfaced THREE pre-existing defect classes beyond the target: RC-1 two-column interleave (ip_feldman/chandrashekar/chan_feldman/ar_apa-table), B1 table-completeness (plos Tables 2/3/4/5 lose rows/cols/bodies), and metadata-leak (plos affiliations/abbrev/running-headers — an RC-2 residual). Per 0e-bis the run's standing verdict stays FAIL; cycle 1 is an incremental ship, not a clean PASS. Surfaced to user as the run punch-list.
+---
+## Run: 2026-06-15 · cycle 2 (resume) · RC-1 Step 2 shipped v2.4.90 (ship-dark)
+### Outcome
+- Resumed the open run (B7 v2.4.89 already committed by a concurrent session-instance; independently re-verified it: test 5/5, suite 547 passed, baseline 26/26 — no work lost). Implemented **RC-1 Step 2** (`extract_page_text_banded`): per-band region-aware two-column re-extraction, the architectural fix for THE dominant defect. Ship-dark behind `DOCPLUCK_COLUMN_CORRECT_BANDED` (default OFF; flag-OFF byte-identical, 26/26 baseline unchanged). AI-verified ON_BETTER on chan_feldman + chandrashekar vs article-finder golds (0 text-loss/halluc/regression). Committed `1325d14` + local tag v2.4.90, deploy HELD.
+### Blind spots / learnings
+- **The word-preservation guard is the load-bearing safety, not the heuristic.** It rejects any non-pure-reorder, so band segmentation can be optimized for COVERAGE not correctness. Corpus word-multiset scan (flag-OFF vs flag-ON identical) is the fast pre-AI-verify gate. Three hazards it caught: full-width-title column-split (fix: gutter-strip-clear row test, not "no word spans gx"), band-cut glyph bisection (fix: merge overlapping bands to full-width), per-row-both-sides vs gutter-clear coverage/rejection trade. Durable: shared card `band-reextraction-lean-on-word-preservation-guard` + project lessons.md.
+- **Concurrency hazard observed:** a second session-instance committed B7 to the SAME working tree mid-run (foreign pytest procs + a commit at 14:59 UTC during my session). Reconciled via git log + run-meta (it had run its postflight + paused PARTIAL-PAUSED). Lesson: on resume, `git log origin/main..HEAD` + run-meta `completed_at`/`run_closeout` BEFORE assuming the working tree is yours; a clean `git add` that stages nothing means someone already committed it.
+### Process notes
+- cycle-2 iterate-gate = FAIL (I2 partial canary coverage: only chan+chandra AI-verified of the canary/target set; I3 tables remain; I10 rendered_sha is the flag-ON render not the gate's flag-OFF canary artifact). Honest incremental-cycle FAIL per 0e-bis — run stays OPEN/PARTIAL, NOT closed. Remaining punch-list in the handoff + spec "Step 2 — remaining work".

{docpluck-2.4.90 → docpluck-2.4.91}/.github/workflows/test.yml RENAMED Viewed

@@ -36,3 +36,6 @@ jobs:
       - name: Run tests
         run: pytest tests/ -v --tb=short
+      - name: Check docs and metadata consistency
+        run: python scripts/check_docs_consistency.py

{docpluck-2.4.90 → docpluck-2.4.91}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,15 @@
 # Changelog
+## [2.4.91] — 2026-06-17
+**Single-column subsection-heading promotion — recover glued JESP/Elsevier subsection headings without re-opening the two-column G5d trap.** Render-layer only; no `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change.
+Surfaced by `/docpluck-iterate` Phase-5d (canary `ar_apa_j_jesp_2009_12_011`). Single-column Elsevier/JESP papers emit subsection headings ("Overview", "Practice instructions", "Self-control assessment") on their own line with **no blank padding on either side** — glued directly between the prior subsection's sentence-terminated body and their own body. Every existing promoter requires `blank_before AND blank_after` (or the PSPB no-blank-after relaxation, which still requires `blank_before`), so these stayed demoted to body text.
+`_promote_isolated_titlecase_subsection_headings` now admits a no-blank-before candidate **only when the document is single-column** AND the immediately-preceding line is a sentence-terminated prose line. The single-column signal (`_raw_text_is_single_column`) is the fraction of raw-pdftotext non-blank lines wider than 65 chars (≥ 0.25) — a structural typographic invariant computed from the **raw** text *before* the render pipeline joins column-wrapped lines and destroys the line-width signal (corpus separation: two-column 0.06–0.24, single-column 0.28–0.58; threshold sits in the natural gap). Two-column layouts keep the hard `blank_before` reject, where the identical shape is a narrow table-cell / measures-list label (the G5d hallucinated-heading trap). A fragment guard (`_is_single_col_relaxation_fragment`) additionally rejects bracket furniture, leading-preposition sentence-wraps, dangling-connector tails, and **abbreviation-glossary / clause lists** (internal `;` or trailing `,`).
+**Validation (2026-06-17).** Deterministic heading-delta across the corpus: `ar_apa_j_jesp_2009_12_011` **+3** genuine headings (AI-verified vs the article-finder reading gold: `new_headings_are_real=true`, zero hallucination), `ar_apa_…_010` +9, `jmf_1` +11, `demography_1` +4, `bjps_1` +3, `chen_2021_jesp` +3; two-column papers (`ip_feldman_2025_pspb` byte-identical, `chan_feldman`, `chandrashekar`) **+0**; `plos_med_1` net +0 (one abbreviation-glossary false promotion caught by the fragment guard). 26-paper baseline 26/26 (one transient pdftotext timeout, PASS in isolation); full suite 1733 passed. New regression `tests/test_single_column_subsection_promote_real_pdf.py` (23 cases). The canary corpus still FAILs on **pre-existing** table row/column loss and RC-1 column-interleave (separate root causes, queued) — this release is an incremental heading-fidelity improvement, not a corpus-clean claim.
 ## [2.4.90] — 2026-06-15
 **RC-1 Step 2 — per-band region-aware two-column re-extraction (ship-dark behind `DOCPLUCK_COLUMN_CORRECT_BANDED`, default OFF).** No `NORMALIZATION_VERSION` change — the default path is byte-identical; the flag only adds reading-order corrections upstream of normalize.

{docpluck-2.4.90 → docpluck-2.4.91}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.90
+Version: 2.4.91
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://docpluck.app
 Project-URL: Documentation, https://docpluck.app/api-docs
@@ -57,7 +57,7 @@ Supports three input formats:
 - **DOCX** via `mammoth` (DOCX → HTML → text, preserving Shift+Enter soft breaks)
 - **HTML** via `beautifulsoup4` + `lxml` (block/inline-aware tree-walk)
-All three formats feed into the same 15-step normalization pipeline and quality scoring.
+All three formats feed into the same normalization pipeline and quality scoring.
 ---
@@ -298,8 +298,8 @@ Apply the normalization pipeline at the specified level.
 | Level | Steps | Use when |
 |-------|-------|----------|
 | `none` | — | You want raw text, no modifications |
-| `standard` | S0-S9 | General text processing (NLP, search indexing) |
-| `academic` | S0-S9 + A1-A6 | Statistical pattern matching, meta-analysis |
+| `standard` | Core cleanup (`S*`) + document-shape cleanup (`F0/H0/T0/P0/P1/W0`) + recovery/ref joins (`R2/R3/A7`) | General text processing (NLP, search indexing) |
+| `academic` | `standard` + statistical repairs (`A*` and `W0*`) | Statistical pattern matching, meta-analysis |
 ```python
 from docpluck import normalize_text, NormalizationLevel
@@ -313,7 +313,7 @@ text, report = normalize_text(raw, NormalizationLevel.standard)
 # Full statistical repair (recommended for academic PDFs)
 text, report = normalize_text(raw, NormalizationLevel.academic)
-print(report.version)          # "1.1.0"
+print(report.version)          # e.g., "1.9.35"
 print(report.steps_applied)    # ["S0_smp_to_ascii", "S1_encoding_validation", ...]
 print(report.changes_made)     # {"ligatures_expanded": 27, "dashes_normalized": 3, ...}
 ```
@@ -323,7 +323,7 @@ print(report.changes_made)     # {"ligatures_expanded": 27, "dashes_normalized":
 | Field | Type | Description |
 |-------|------|-------------|
 | `level` | `str` | Level used: `"none"`, `"standard"`, or `"academic"` |
-| `version` | `str` | Pipeline version (e.g. `"1.1.0"`) |
+| `version` | `str` | Pipeline version (e.g. `"1.9.35"`) |
 | `steps_applied` | `list[str]` | Step codes in order (e.g. `["S1_encoding_validation", "S3_ligature_expansion"]`) |
 | `changes_made` | `dict[str, int]` | Character-level change counts per step |

docpluck-2.4.91/README.md ADDED Viewed

@@ -0,0 +1,35 @@
+# docpluck
+PDF, DOCX, and HTML text extraction plus normalization for academic papers.
+The full documentation lives in `docs/README.md`.
+## Quick install
+```bash
+pip install docpluck
+```
+Optional extras:
+```bash
+pip install docpluck[docx]
+pip install docpluck[html]
+pip install docpluck[all]
+```
+## System requirement for PDF extraction
+`extract_pdf()` requires the `pdftotext` binary from Poppler.
+- Linux/WSL: `apt-get install poppler-utils`
+- macOS: `brew install poppler`
+- Windows: install Poppler and add its `bin` folder to `PATH`
+## Links
+- Full usage and API reference: `docs/README.md`
+- Normalization pipeline details: `docs/NORMALIZATION.md`
+- Benchmarks: `docs/BENCHMARKS.md`
+- Design notes: `docs/DESIGN.md`

{docpluck-2.4.90 → docpluck-2.4.91}/TODO.md RENAMED Viewed

@@ -2,6 +2,11 @@
 This file tracks future-aim items that are scoped out of the current milestone but should not be lost. See `docs/superpowers/specs/` for active specs.
+## 2026-06-16 — deferred for investigation before code changes
+- [ ] **Investigate `sections=` extraction de-dup (no behavior change yet).** `extract_pdf(..., sections=...)`, `extract_docx(..., sections=...)`, and `extract_html(..., sections=...)` currently do one extraction pass and then call `extract_sections(...)`, which can re-run extraction/annotation internally by design. Before optimizing, document invariants proving parity with direct `extract_sections(...)` outputs, then run corpus/harness verification to confirm zero regressions. No implementation change until those proofs are in place.
+- [ ] **Investigate adding first-class diagnostics to structured outputs.** Current fallback visibility is via `method` suffixes and telemetry counters. Evaluate a stable `diagnostics` field on structured/text outputs (schema, backwards-compat guarantees, and consumer impact) before landing any API contract change.
 ## 2026-06-13 — v2.4.86/87/88 landed (ScienceArena GROBID/liteparse re-audit; PUSHED to origin/main e8b275d, NOT tagged)
 > ✅ **Push status:** committed AND pushed to `origin/main` as `e8b275d` (`8bfcdba..e8b275d`). The earlier "push hangs" were NOT a network issue — they were the **pre-push canary hook** running the slow full 5-paper audit, which kept getting killed by short timeouts. Pushed with `SKIP_CANARY=1` (justified: the canary's render subprocess was broken by the Python-env issue below, so its verdict was invalid; the changes were independently verified — 1896-test baseline green, deterministic ip_feldman render-diff identical on headings, camelot table fix confirmed on efendic/xiao/maier). NOT tagged.
@@ -183,6 +188,7 @@ Add only when a real downstream consumer asks for one. YAGNI until then.
 ### Deferred from this session (surfaced, not hacked)
 - [ ] **`### Reasons for change`** (ip_feldman) — Table 5 column header promoted to heading; needs table-region awareness (the body-coherence guard doesn't catch it because its body starts capitalized). RCA: rank-3 in the 2026-06-06 run-11 RCA.
+- [ ] **Canary finding-key case-norm false-positive** (tooling) — the strict tag-push canary re-flags pre-existing backlog findings as "NEW" when only the leading case differs (`we`→`We`, `extensions`→`Extensions`), forcing `SKIP_CANARY=1` on every release while the deferred backlog is open (it did so for v2.4.90, 2026-06-15). Lowercase-normalize the finding key (TODO ~line 165 in `~/.claude/skills/_shared/iterate-loop/canary-audit.sh` per memory `feedback_canary_gate_nondeterministic`) so release tags pass cleanly when there are no real regressions.
 - [ ] **`## Data Availability` end-matter absent** — RCA CORRECTED the run-11 "demoter over-strip" premise: the section never enters the text channel (pdftotext drops the title-page box). Needs cross-channel (pdfplumber) recovery, same architecture class as B7. NOT a demoter exception.
 - [ ] **Glyph `Västfjäll`→`Vastfall`** (ar_apa/collabra, citationguard Defect 2) — baked pdftotext CID-font mis-map; needs a same-document surname-consensus normalizer (new subsystem). **Product/architecture decision on scope.**
 - [x] **Baked-glyph DIGIT misread `M_age 59.3`→`39.3`** (collabra.77859, surfaced 2026-06-08 RC-1 AI-verify) — **DIAGNOSED + DECISION MADE (2026-06-08): document as known limitation, no code change.** Same class as `Västfjäll` but a DIGIT in a statistic (silent stat corruption, the most dangerous form for meta-science): the PDF *visually* shows `59.3` but the embedded text codepoint is baked as `3`, and **both pdftotext AND pdfplumber faithfully extract `39.3`** (confirmed by visual PDF read + dual-extractor diff). No text-channel logic can recover it; the only fixes are OCR/multimodal-glyph-consensus (a new subsystem the user explicitly **declined** to scope this session). **Consumer note: CitationGuard / downstream stat-checkers must assume baked digit/letter misreads exist in source PDFs and apply their own cross-source (CrossRef/visual) verification — docpluck cannot guarantee a digit matches the visual glyph when the publisher baked the wrong codepoint.**

{docpluck-2.4.90 → docpluck-2.4.91}/docpluck/__init__.py RENAMED Viewed

@@ -78,7 +78,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.90"
+__version__ = "2.4.91"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.90 → docpluck-2.4.91}/docpluck/extract.py RENAMED Viewed

@@ -22,8 +22,15 @@ import tempfile
 from pathlib import Path
 from typing import Optional, Union
-def extract_pdf(pdf_bytes: bytes, *, sections: list[str] | None = None) -> tuple[str, str]:
+from .telemetry import record_fallback
+def extract_pdf(
+    pdf_bytes: bytes,
+    *,
+    sections: list[str] | None = None,
+    max_input_bytes: int | None = None,
+    pdftotext_timeout_seconds: int = 120,
+) -> tuple[str, str]:
     """Extract text from PDF bytes.
     Uses pdftotext as the primary engine. Automatically falls back to
@@ -48,6 +55,12 @@ def extract_pdf(pdf_bytes: bytes, *, sections: list[str] | None = None) -> tuple
               "pdftotext_default"                   — normal extraction
               "pdftotext_default+pdfplumber_recovery" — SMP fallback triggered
+    Guardrails:
+        max_input_bytes: Optional hard cap for input size. When set and
+            ``len(pdf_bytes)`` exceeds it, a ValueError is raised.
+        pdftotext_timeout_seconds: Timeout for the pdftotext subprocess.
+            Default 120 seconds preserves current behavior.
     Requires:
         pdftotext binary (from poppler-utils) on PATH.
@@ -60,6 +73,13 @@ def extract_pdf(pdf_bytes: bytes, *, sections: list[str] | None = None) -> tuple
         with open("paper.pdf", "rb") as f:
             text, method = extract_pdf(f.read(), sections=["abstract", "methods"])
     """
+    if max_input_bytes is not None and len(pdf_bytes) > max_input_bytes:
+        raise ValueError(
+            f"PDF input exceeds max_input_bytes: {len(pdf_bytes)} > {max_input_bytes}"
+        )
+    if pdftotext_timeout_seconds <= 0:
+        raise ValueError("pdftotext_timeout_seconds must be > 0")
     with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
         tmp.write(pdf_bytes)
         tmp_path = tmp.name
@@ -69,7 +89,7 @@ def extract_pdf(pdf_bytes: bytes, *, sections: list[str] | None = None) -> tuple
         result = subprocess.run(
             ["pdftotext", "-enc", "UTF-8", tmp_path, "-"],
             capture_output=True,
-            timeout=120,
+            timeout=pdftotext_timeout_seconds,
             encoding="utf-8",
             errors="replace",
         )
@@ -212,8 +232,10 @@ def extract_pdf(pdf_bytes: bytes, *, sections: list[str] | None = None) -> tuple
                 if corrected and corrected != text and changed:
                     text = corrected
                     method = f"{method}+column_corrected:{','.join(map(str, changed))}"
-        except Exception:
-            pass
+        except Exception as exc:
+            exc_name = type(exc).__name__
+            record_fallback("column_correction_exception", detail=exc_name)
+            method = f"{method}+column_correction_failed:{exc_name}"
         if sections is not None:
             from .sections import extract_sections
@@ -293,11 +315,13 @@ def count_pages(pdf_bytes: bytes) -> int:
             import pdfplumber  # type: ignore[import-not-found]
             with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
                 return max(len(pdf.pages), 1)
-        except Exception:
+        except Exception as exc:
+            record_fallback("count_pages_pdfplumber_fallback_failed", detail=type(exc).__name__)
             # pdfplumber failed (corrupt PDF, password-protected, etc.) —
             # fall back to the heuristic's value.
             return max(count, 1)
-    except Exception:
+    except Exception as exc:
+        record_fallback("count_pages_exception", detail=type(exc).__name__)
         return 0
@@ -458,5 +482,6 @@ def _recover_with_pdfplumber(pdf_path: str) -> Optional[str]:
         return full_text
-    except Exception:
+    except Exception as exc:
+        record_fallback("pdfplumber_recovery_exception", detail=type(exc).__name__)
         return None

{docpluck-2.4.90 → docpluck-2.4.91}/docpluck/extract_columns.py RENAMED Viewed

@@ -356,296 +356,6 @@ def _word_multiset(text: str) -> "Counter":
     return Counter(toks)
-# ── RC-1 Step 2: per-band region-aware column de-interleave ──
-#
-# `extract_page_text_columns` corrects a WHOLE page and is (correctly) refused
-# by the bilateral / full-height-gutter gate whenever the page carries an
-# embedded full-width table or banner — a whole-page left-then-right crop would
-# slice straight through the table. That leaves the two-column PROSE bands
-# above/below the table interleaved — the dominant residual on two-column APA
-# papers (Collabra / JESP / chandrashekar / ip_feldman).
-#
-# Step 2 segments the page into horizontal y-bands and column-corrects only the
-# bands that are genuinely two-column across THEIR OWN y-range, leaving
-# full-width bands (table rows, banners, spanning headings) untouched. The
-# gutter is located by the MIN-CROSSING central x — it tolerates table/banner
-# rows that cross it (those rows simply classify as full-width) — rather than
-# the whole-page full-height clean strip a table would destroy.
-#
-# Safety is layered so no corruption can ship (rules 0a / 0b):
-#   1. a band is column-cropped only when both sides carry substantial text AND
-#      no word straddles the cut x;
-#   2. per-band word-preservation — the two-column crop is accepted only when
-#      its substantial-word multiset equals the band's full-width crop (a pure
-#      reorder). A straddle-split (`donation` → `dona` + `tion`) or a
-#      cross-column glue difference (`betweenoriginal` → `between` + `subject`)
-#      makes them differ → that band alone falls back to the word-correct
-#      full-width crop, preserving the page's other good bands;
-#   3. the caller's UNCONDITIONAL page-level word-preservation guard in
-#      `splice_column_corrected_pages` is the final backstop — any page whose
-#      reassembly changes the page word multiset is rejected wholesale and the
-#      original (interleaved but word-correct) text kept. Worst case is an
-#      unimproved page, never a corrupted one.
-#
-# Keyed on a structural signature (gutter geometry + per-row column occupancy),
-# never on paper identity (CLAUDE.md general-fix rule). Algorithm validated as a
-# tmp/ prototype on 71 flagged pages across 5 two-column papers before promotion
-# (65/71 word-safe before per-band fallback; the 6 straddle/glue violations are
-# what safety #2 localizes). pdftotext crop-mode is used for spacing fidelity,
-# consistent with `extract_page_text_columns` (CLAUDE.md hard rule 3: conditional
-# per-page re-extraction, not a default tool swap).
-# Half-width (PDF points) of the central strip that must be glyph-free for a row
-# to read as two-column. A full-width line's inter-word space (~3-4pt) is
-# narrower than 2*_GUTTER_STRIP_HALF, so a justified title/abstract line is
-# correctly classed full-width; a real column gutter (10-30pt) clears it.
-_GUTTER_STRIP_HALF = 4.0
-def _min_crossing_gutter(words: list[dict], page_width: float) -> float | None:
-    """Find the central-band x crossed by the FEWEST distinct text rows.
-    Unlike `_detect_2col_midline_gutter` (which REQUIRES a near-zero-crossing
-    full-height strip and so returns None on any page with an embedded
-    full-width table row), this returns the *least-crossed* central x even when
-    table/banner rows cross it — those rows become full-width bands downstream.
-    Confined to [0.35W, 0.65W] and tie-broken toward page center. Returns None
-    when the page has too few text rows to trust a gutter.
-    """
-    if not words or page_width <= 0:
-        return None
-    lo_i, hi_i = int(page_width * 0.35), int(page_width * 0.65)
-    if hi_i - lo_i < _MIN_GUTTER_STRIP_WIDTH:
-        return None
-    all_rows = {int(round(w["top"] / _LINE_Y_TOLERANCE)) for w in words}
-    if len(all_rows) < 10:
-        return None
-    crossings: dict[int, set] = defaultdict(set)
-    for w in words:
-        x0 = max(lo_i, int(w["x0"]))
-        x1 = min(hi_i, int(w["x1"]))
-        if x1 < x0:
-            continue
-        rk = int(round(w["top"] / _LINE_Y_TOLERANCE))
-        for x in range(x0, x1 + 1):
-            crossings[x].add(rk)
-    center = page_width / 2.0
-    best_x: int | None = None
-    best_n: int | None = None
-    for x in range(lo_i, hi_i + 1):
-        n = len(crossings.get(x, ()))
-        if (best_n is None or n < best_n
-                or (n == best_n and abs(x - center) < abs(best_x - center))):
-            best_x, best_n = x, n
-    return float(best_x) if best_x is not None else None
-def _row_is_two_column(row_words: list[dict], gutter_x: float) -> bool:
-    """A row is column-compatible (NOT full-width) iff NO word's horizontal
-    extent enters the central strip [gx-Δ, gx+Δ] — i.e. nothing crosses the
-    column gutter / crop line.
-    Crucially this does NOT require text on both sides: in real staggered
-    two-column prose most rows carry text in only ONE column at a given y
-    (paragraph ends, ragged column bottoms), and those one-sided rows belong in
-    the two-column band — they crop cleanly to their own side and the empty
-    side contributes nothing. Requiring both sides per-row (an earlier
-    prototype's rule) fragments a genuine two-column region into noise and
-    misses it entirely (collabra_77859). The "is this band actually
-    two-column" judgement is made once at the BAND level (both sides ≥ 25% of
-    the band's words) in `extract_page_text_bands`, which is the correct scale
-    for it. A full-width line (table row, banner, spanning heading, or any line
-    with a word straddling the cut) has a word in the strip and so is False."""
-    return not any(
-        w["x0"] <= gutter_x + _GUTTER_STRIP_HALF
-        and w["x1"] >= gutter_x - _GUTTER_STRIP_HALF
-        for w in row_words
-    )
-def _segment_into_bands(words: list[dict], gutter_x: float, page_height: float):
-    """Group a page's text rows into contiguous full-width / two-column y-bands.
-    Returns a list of ``(is_full_width, y_top, y_bottom, band_words)`` in
-    y-order. Up to one isolated opposite-class row is tolerated inside a run (a
-    stray descender crossing the gutter, a one-line full-width subhead).
-    Adjacent bands whose y-extents OVERLAP are merged and forced full-width —
-    overlapping mixed-size content (a tall title line abutting the next row)
-    can't be cleanly column-separated, and a full-width crop is the safe,
-    word-preserving fallback there.
-    """
-    rows: dict[int, list[dict]] = defaultdict(list)
-    for w in words:
-        rows[int(round(w["top"] / _LINE_Y_TOLERANCE))].append(w)
-    classified = []  # (y_top, y_bottom, is_full_width, words)
-    for rk in sorted(rows):
-        ws = rows[rk]
-        classified.append((
-            min(w["top"] for w in ws),
-            max(w["bottom"] for w in ws),
-            not _row_is_two_column(ws, gutter_x),
-            ws,
-        ))
-    if not classified:
-        return []
-    # Group contiguous same-class rows (tol = 1 isolated opposite row).
-    grouped: list[tuple[bool, list]] = []
-    cur_class = classified[0][2]
-    run = [classified[0]]
-    opp = 0
-    for row in classified[1:]:
-        if row[2] == cur_class:
-            run.append(row)
-            opp = 0
-        else:
-            opp += 1
-            run.append(row)
-            if opp > 1:
-                keep = run[:-opp]
-                if keep:
-                    grouped.append((cur_class, keep))
-                run = run[-opp:]
-                cur_class = row[2]
-                opp = 0
-    if run:
-        grouped.append((cur_class, run))
-    bands = []
-    for fw, rws in grouped:
-        bands.append([
-            fw,
-            min(r[0] for r in rws),
-            max(r[1] for r in rws),
-            [w for r in rws for w in r[3]],
-        ])
-    # Overlap-merge adjacent bands (force full-width on merge).
-    merged: list[list] = [list(bands[0])]
-    for b in bands[1:]:
-        prev = merged[-1]
-        if b[1] <= prev[2]:  # b.y_top <= prev.y_bottom → overlap
-            prev[0] = True
-            prev[2] = max(prev[2], b[2])
-            prev[3] = prev[3] + b[3]
-        else:
-            merged.append(list(b))
-    return [(b[0], b[1], b[2], b[3]) for b in merged]
-def extract_page_text_bands(layout_doc, page_index: int,
-                            pdf_bytes: bytes | None = None) -> str:
-    """Per-band region-aware column de-interleave for a single page.
-    The Step-2 complement to `extract_page_text_columns`: when a page carries an
-    embedded full-width table/banner (so the whole-page corrector's bilateral /
-    full-height-gutter gate skips it), segment the page into y-bands and reorder
-    only the genuinely-two-column bands left-then-right, leaving full-width bands
-    as-is, then reassemble in y-order at glyph-free cut lines.
-    Args:
-        layout_doc: LayoutDoc from `docpluck.extract_layout.extract_pdf_layout`.
-        page_index: 0-based page index.
-        pdf_bytes: raw PDF bytes (required — pdftotext crop-mode gives the
-            word spacing pdfplumber drops on tight-kerned PDFs).
-    Returns:
-        The reassembled page text, or "" when the page has no clean gutter or no
-        confidently-correctable two-column band (caller keeps the original
-        text). Every two-column band emitted is a per-band word-preserving
-        reorder; the caller's page-level guard is the final backstop.
-    """
-    if pdf_bytes is None:
-        return ""
-    if page_index < 0 or page_index >= len(layout_doc.pages):
-        return ""
-    page = layout_doc.pages[page_index]
-    page_width = float(page.width or 0.0)
-    page_height = float(page.height or 0.0)
-    if page_width <= 0 or page_height <= 0:
-        return ""
-    words = list(page.words or ())
-    if len(words) < _MIN_WORDS_FOR_COLUMN_MODE:
-        return ""
-    gutter_x = _min_crossing_gutter(words, page_width)
-    if gutter_x is None:
-        return ""
-    bands = _segment_into_bands(words, gutter_x, page_height)
-    if not bands:
-        return ""
-    # Clean cut lines: after overlap-merge the bands are non-overlapping, so each
-    # inter-band midpoint sits in a glyph-free gap. Crop [cut[i], cut[i+1]] so the
-    # page is covered top-to-bottom with no gaps/overlaps and no horizontal cut
-    # bisects a glyph.
-    cuts = [0.0]
-    for i in range(len(bands) - 1):
-        cuts.append((bands[i][2] + bands[i + 1][1]) / 2.0)
-    cuts.append(page_height)
-    import os
-    import subprocess
-    import tempfile
-    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
-        tmp.write(pdf_bytes)
-        tmp_path = tmp.name
-    def _crop(x: float, y: float, w: float, h: float) -> str:
-        if w <= 1 or h <= 1:
-            return ""
-        proc = subprocess.run(
-            ["pdftotext", "-enc", "UTF-8",
-             "-f", str(page_index + 1), "-l", str(page_index + 1),
-             "-x", str(int(x)), "-y", str(int(y)),
-             "-W", str(int(w)), "-H", str(int(h)), tmp_path, "-"],
-            capture_output=True, timeout=30, encoding="utf-8", errors="replace",
-        )
-        if proc.returncode != 0:
-            return ""
-        return (proc.stdout or "").rstrip("\f").strip()
-    try:
-        parts: list[str] = []
-        n_two_col = 0
-        for i, (is_full_width, _y_top, _y_bottom, band_words) in enumerate(bands):
-            top, bot = cuts[i], cuts[i + 1]
-            height = bot - top
-            did_two_col = False
-            if not is_full_width and band_words:
-                left = [w for w in band_words if (w["x0"] + w["x1"]) / 2 < gutter_x]
-                right = [w for w in band_words if (w["x0"] + w["x1"]) / 2 >= gutter_x]
-                straddles = any(w["x0"] < gutter_x < w["x1"] for w in band_words)
-                if (len(left) >= 0.25 * len(band_words)
-                        and len(right) >= 0.25 * len(band_words)
-                        and not straddles):
-                    lt = _crop(0, top, gutter_x, height)
-                    rt = _crop(gutter_x, top, page_width - gutter_x, height)
-                    # Per-band word-preservation (safety #2): accept the
-                    # two-column reorder only when it neither splits nor re-glues
-                    # a word vs the band's own full-width crop. Otherwise that
-                    # band falls back to full-width — word-correct, still
-                    # interleaved — without sacrificing the page's other bands.
-                    if lt.strip() and rt.strip():
-                        fw = _crop(0, top, page_width, height)
-                        if _word_multiset(lt + "\n" + rt) == _word_multiset(fw):
-                            parts.append((lt + "\n" + rt).strip())
-                            n_two_col += 1
-                            did_two_col = True
-            if not did_two_col:
-                parts.append(_crop(0, top, page_width, height))
-        if n_two_col == 0:
-            # No band was confidently column-corrected → no improvement over the
-            # original; signal the caller to keep the original page text.
-            return ""
-        return "\n".join(p for p in parts if p.strip())
-    finally:
-        try:
-            os.unlink(tmp_path)
-        except Exception:
-            pass
 def splice_column_corrected_pages(
     raw_text: str,
     layout_doc,

{docpluck-2.4.90 → docpluck-2.4.91}/docpluck/extract_docx.py RENAMED Viewed

@@ -29,7 +29,12 @@ import io
 from .extract_html import html_to_text
-def extract_docx(docx_bytes: bytes, *, sections: list[str] | None = None) -> tuple[str, str]:
+def extract_docx(
+    docx_bytes: bytes,
+    *,
+    sections: list[str] | None = None,
+    max_input_bytes: int | None = None,
+) -> tuple[str, str]:
     """Extract text from DOCX file bytes.
     Converts the DOCX to HTML via mammoth (preserving soft breaks and block
@@ -42,6 +47,8 @@ def extract_docx(docx_bytes: bytes, *, sections: list[str] | None = None) -> tup
             "methods"]``) to filter the output. When provided, ``extract_sections``
             is called and only the requested sections are returned concatenated
             in document order. Pass ``None`` (default) to return the full text.
+        max_input_bytes: Optional hard cap for input size. When set and
+            ``len(docx_bytes)`` exceeds it, a ValueError is raised.
     Returns:
         A tuple of (text, method) where:
@@ -68,6 +75,11 @@ def extract_docx(docx_bytes: bytes, *, sections: list[str] | None = None) -> tup
     # Lazy import so the core library works without mammoth installed
     import mammoth
+    if max_input_bytes is not None and len(docx_bytes) > max_input_bytes:
+        raise ValueError(
+            f"DOCX input exceeds max_input_bytes: {len(docx_bytes)} > {max_input_bytes}"
+        )
     result = mammoth.convert_to_html(io.BytesIO(docx_bytes))
     html = result.value
     text = html_to_text(html)

docpluck 2.4.90__tar.gz → 2.4.91__tar.gz

docpluck 2.4.90tar.gz → 2.4.91tar.gz