PyPI - docpluck - Versions diffs - 2.4.43__tar.gz → 2.4.45__tar.gz - Mend

docpluck 2.4.43tar.gz → 2.4.45tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (335) hide show

{docpluck-2.4.43 → docpluck-2.4.45}/.claude/skills/_project/lessons.md RENAMED Viewed

@@ -182,3 +182,19 @@ Plus three golden snapshot files (`tests/golden/sections/*.json`) had the versio
 **Why:** When the section partitioner fails to recognise a heading, it leaves the heading text on its own line but does NOT blank-separate it from the surrounding section body. So a demoted heading looks like: `<last line of prev section>` / `2. Omission neglect` / `<first line of this section>`.
 **How to detect (next time):** Never gate heading-detection/promotion on blank-line isolation — demoted headings are wedged in prose. Use a different discriminator: for a numbered heading, "not adjacent to a sibling `N.` line" distinguishes a heading-before-section-body from a list-item-in-a-list. For single-level numbered promotion specifically, layer multiple independent gates (document-numbering-range, number-uniqueness, list-adjacency, terminal-punctuation, lowercase-run) so an enumerated list is rejected by several of them at once — defense in depth keeps a wide-false-positive-surface fix safe.
+## 2026-05-16 · Cycle 12 — don't add a normalize step that duplicates an existing one; localize the defect channel first (v2.4.44)
+**What:** Latin typographic ligatures (`ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ`, U+FB00-FB06) rendered verbatim — `conﬁdent`, `inﬂuence` — in 35 corpus `.md` files. The first cycle-12 attempt added a new `decompose_ligatures` helper and called it EARLY in `normalize_text`, not noticing `normalize_text` already had an `S3_ligature_expansion` step (FB00-FB04). The early call consumed every ligature before S3 ran, so S3 tracked `ligatures_expanded = 0` and `test_report_tracks_changes` broke. Worse, the body channel was never the problem: a channel check showed all 35 papers' ligatures sat in table cells / figure-table captions / `unstructured-table` fences — channels that bypass `normalize_text` entirely. The rework removed the duplicate call, unified S3 to call the shared helper (full FB00-FB06 block via an explicit ASCII table — NFKC of `ﬅ` yields a non-ASCII long-s), and kept the genuinely-new `cell_cleaning` + render-post-process calls.
+**Why:** Two failure modes compounded. (1) A new normalize helper added without grepping the existing `normalize_text` S-steps duplicated S3 and, placed before it, starved it. (2) The cycle was scoped from a symptom ("35 papers show ligatures") without localizing WHICH channel was at fault — the body channel was already correct.
+**How to detect (next time):** Before adding any glyph/encoding helper to `normalize.py`, grep the existing `S0`-`S9` / `W0*` steps for one already handling that character class — extend/unify it rather than adding a parallel path, and never insert a new step *before* an existing one that consumes the same input. Before scoping a glyph cycle, localize the defect: grep the offending glyph's lines in a recent render and confirm whether they sit in `<td>`/`<th>`/`*Table N*`/```unstructured-table``` (table/caption/fence channels — bypass `normalize_text`) or in body prose (the S-step channel).
+## 2026-05-16 · Cycle 13 — a heuristic guard's value depends on the false-positive surface, which differs per call site (v2.4.45)
+**What:** `render.py`'s two numbered-heading promoters shared a `max_lc_run >= 5` "long lowercase-word run" prose guard. It demoted legitimate descriptive headings — jdm_.2023.16 had 19 multi-level numbered subsection headings rendered as body text, with lowercase-runs up to 12 (`3.3.2.1. The quality of planning on the previous trial moderates the effect of reflection`). The fix removed the guard ENTIRELY from `_promote_numbered_subsection_headings` but KEPT it (raised 5→8) in `_promote_numbered_section_headings`.
+**Why:** A lowercase-word-run count genuinely cannot distinguish a descriptive section heading from prose — both have many lowercase words. What makes a line a heading is the *number shape* + capital-start + no-terminal-punctuation + single short line. For **multi-level** dotted numbering (`N.N[.N…]`) that signature is decisive — a prose line almost never begins with a multi-level dotted number — so the lc-run guard was pure harm. For **single-level** `N.` numbering the signature is weak (a `2.` line collides with an enumerated-list item), so a prose guard there still adds value as defense-in-depth. Same guard, opposite verdicts, because the false-positive surface differs between the two call sites.
+**How to detect (next time):** When a heuristic guard rejects legitimate inputs, do not just retune its threshold — ask whether the guard discriminates at all at that call site. Reproduce at HEAD and measure the metric's spread on real positives (here: heading lowercase-runs ran 0-12, overlapping prose entirely → no threshold works). If a guard can't separate the classes, remove it where the *other* gates already suffice and keep it only where they don't. When a guard is removed, grep its tests — a contract test (`test_render.py::test_promote_rejects_prose_with_long_lowercase_run`) was asserting the removed behavior and had to be updated in the same cycle.

{docpluck-2.4.43 → docpluck-2.4.45}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -503,3 +503,47 @@ Mid-run, ArticleFinder flagged (and the user confirmed as a directive) that docp
 ### SPINE-SKIPs
 - R3 (`/docpluck-cleanup` + `/docpluck-review`) — SKIPPED. Cycle 11 is one new render post-processor, gated by 5 conjunctive safety checks; 26/26 baseline + AI-gold verifier confirm 0 false positives. Same shape as cycles 1-10.
+---
+## Run: 2026-05-16 (autonomous APA-first run, session 3) · Cycle 12 · v2.4.44
+> **Reworked in run 4 (2026-05-16).** The session-3 cycle-12 attempt was broken — it duplicated the pre-existing S3 step and was never committed. The entry below describes the *reworked, shipped* cycle 12.
+### Outcome
+- **Cycle 12 shipped v2.4.44** — Latin typographic ligatures (U+FB00-FB06: ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ) leaked verbatim in the **table-cell, figure/table-caption, and `unstructured-table`-fence channels**. The body channel's `normalize.py` S3 step already expanded ligatures correctly; those three channels bypass `normalize_text`. `normalize.py::decompose_ligatures` is now the single shared helper for the full U+FB00-FB06 block (explicit ASCII table), called from all three channels (S3 body / `cell_cleaning._html_escape` / `render_pdf_to_markdown` post-process). jdm_m2/korbmacher/jdm16 → 0 residual ligatures. 11 tests.
+### Blind spots / process notes
+- **The session-3 cycle-12 attempt duplicated an existing step.** It added a NEW `decompose_ligatures` call EARLY in `normalize_text` — before the pre-existing `S3_ligature_expansion` step (FB00-FB04). The early call consumed every ligature, so S3 tracked `ligatures_expanded = 0` and `test_report_tracks_changes` broke. **Lesson: before adding a glyph-normalization helper, grep the existing `normalize_text` S-steps for one already handling that glyph class — extend/unify it, never add a parallel path. The rework removed the duplicate call and unified S3 to call the shared helper.**
+- **Verify the body channel is actually broken before "fixing" it.** The cycle was triggered by 35 rendered papers showing raw ligatures — but the body channel was fine; the 35 papers' ligatures were in table cells / captions / fences. A 2-minute check (grep the ligature lines in a recent render, look at whether they sit in `<td>`/`<th>`/`*Table N*`/```unstructured-table``` vs body prose) localizes the defect to the right channel before any code is written.
+- **Explicit ASCII table, not scoped NFKC.** NFKC of `ﬅ` (U+FB05) yields `ſt` with a non-ASCII LONG S — so a per-char NFKC pass does not actually guarantee the ASCII output the docstring promises. An explicit 7-entry table (`ﬅ/ﬆ→st`) does, and matches the existing S3 code style.
+- **The 3-channel glyph pattern, 5th application.** Cycles 2/4/6/7/12 all needed the shared-helper-at-3-chokepoints treatment. The cycle-6 PROPOSED AMENDMENT (still pending user review) is now backed by 5 cycles of evidence — it should be promoted into SKILL.md Phase 4.
+### SPINE-SKIPs
+- R3 (`/docpluck-cleanup` + `/docpluck-review`) — SKIPPED. Cycle 12 is one normalize helper (explicit table over a 7-codepoint block) + S3 unified to call it + 2 bypass-channel call sites; 26/26 baseline + AI verifier confirm no regression. Same shape as cycles 2/4/6/7.
+---
+## Run: 2026-05-16 (run 4, fix-and-continue) · Cycles: cycle-12 rework, tests-regen, cycle 13
+This run executed `docs/HANDOFF_2026-05-16_iterate_run_4_fix_and_continue.md`'s three jobs. Cycle-12 rework + tests-regen + cycle 13 below; the article-finder AI-gold integration (JOB 2) is tracked in the run-meta.
+### tests-regen (commit `c831e28`, no version bump)
+- 15 pre-existing pytest failures triaged. 12 `test_extract_pdf_byte_identical` snapshots + 2 `test_sections_golden` goldens = environmental drift (local pdftotext re-wraps lines differently than the build that captured the snapshots; `extract_pdf` is a pure pdftotext passthrough). Regenerated; the 26-paper baseline is the real extraction-quality gate and stays green.
+- **The 15th, `test_request_09`, is NOT snapshot drift** — it is a real COL-class column-interleave defect: the numbered RSOS bibliography renders as `References\n1. 2. 3. ... 16.\n\nThaler RH...` (the number column split from the entry text). Left red and documented as the escalated COL defect class. Lesson: when a handoff lumps failures as "all snapshot drift," still inspect each — a real-defect-detecting test must never be "regenerated" away.
+### Cycle 13 (v2.4.45) — G5b long-descriptive numbered headings demoted
+### Outcome
+- **Cycle 13 shipped v2.4.45** — `render.py`'s numbered-heading promoters carried a `max_lc_run >= 5` prose guard that demoted legitimate long descriptive headings. Removed the guard entirely from `_promote_numbered_subsection_headings`; raised it `5→8` in `_promote_numbered_section_headings`. jdm_.2023.16: 19 multi-level subsection headings recovered.
+### Blind spots / process notes
+- **The TRIAGE estimate ("raise 5→8") was a partial fix.** Reproducing at HEAD showed jdm16 headings with `max_lc` up to 12 — a `5→8` raise would have left 7 of 19 still demoted. The lesson card `reproduce-triage-defect-at-head-before-trusting-cost-estimate` paid off again: always reproduce and measure before trusting a queue item's prescribed fix. The lc-run count genuinely cannot distinguish a 12-lowercase-word descriptive heading from prose — for multi-level dotted numbering the *number shape* is the discriminator, so the guard had to go, not just move.
+- **A guard worth keeping for one promoter, not the other.** Single-level `N.` numbers collide with enumerated lists (real false-positive risk) → keep a prose guard (raised to 8) as defense-in-depth. Multi-level `N.N[.N…]` numbers do not → the guard was pure harm. Same-named guard, opposite verdicts, because the false-positive surface differs.
+- **A contract test encoded the removed guard.** `test_render.py::test_promote_rejects_prose_with_long_lowercase_run` asserted the old behavior; updated it to assert the new contract (long descriptive titles ARE promoted) in the same cycle — per the cycle-2 `a test can encode the bug` lesson.
+### SPINE-SKIPs
+- R3 (`/docpluck-cleanup` + `/docpluck-review`) — SKIPPED. Cycle 13 is a guard removal + one threshold bump in two render post-processors; 26/26 baseline + heading-promotion-only diff confirm no regression. Same shape as cycles 9/11.
+### Process note — Codex cross-model verification has a Windows UTF-8 bug
+The `gold-generation.md` Step-4 Codex audit misreads UTF-8 gold files as mojibake on this Windows machine (`Västfjäll`→`VA<SI>stfjA<SI>ll`, `–`→`ƒ?"`), producing ~10-24 false "discrepancies" per paper. The gold files are confirmed clean UTF-8. Worked around by re-running Codex with an explicit "files are UTF-8; mojibake is your decode error, not a discrepancy" preamble. **This is article-finder's protocol to fix** — `gold-generation.md` Step 4 needs a UTF-8 read instruction for Windows. Flagged for coordination with the article-finder owner.

{docpluck-2.4.43 → docpluck-2.4.45}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,27 @@
 # Changelog
+## [2.4.45] — 2026-05-16
+**Cycle 13 (autonomous APA-first run) — long descriptive numbered headings demoted to body text (G5b, S1).** `render.py`'s numbered-heading promoters carried a "long lowercase-word run" prose guard (`max_lc_run >= 5`) that rejected legitimate descriptive headings — e.g. `2.4.2.2. Inference of planning strategies and strategy types`, `3.3.2.1. The quality of planning on the previous trial moderates the effect of reflection`. jdm_.2023.16 alone had 19 multi-level numbered subsection headings demoted to body text.
+Fix (v2.4.45) — the lowercase-run guard is **removed from `_promote_numbered_subsection_headings`**: multi-level dotted numbering at line-start is itself a strong section-heading signal (combined with capital-started title + no terminal sentence punctuation + single ≤80-char line), and descriptive subsection titles legitimately run to many lowercase words, so the guard could not distinguish a real heading from prose and only mis-rejected headings. For `_promote_numbered_section_headings` (single-level `N.`, which genuinely collides with enumerated lists) the guard is **kept but raised `5 → 8`** — single-level promotion still has its document-numbering-range / uniqueness / list-adjacency gates as defense in depth.
+jdm_.2023.16: 19 previously-demoted multi-level headings now render as `###`; the v2.4.44→v2.4.45 diff is heading-promotion only (0 text loss, 0 hallucination). 26/26 baseline PASS. New real-PDF + contract tests in `tests/test_numbered_heading_promotion_real_pdf.py` and `tests/test_render.py`.
+~11 APA papers still FAIL Phase-5d verification; the autonomous run continues.
+## [2.4.44] — 2026-05-16
+**Cycle 12 (autonomous APA-first run) — Latin typographic ligatures not decomposed in the table/caption channels (GLYPH, S2).** pdftotext preserves presentation-form ligature glyphs (`ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ`, U+FB00-FB06) verbatim, so words rendered as `conﬁdent` / `inﬂuence` / `eﬃcient` — broken for search, word matching, and any downstream NLP. A corpus scan found the glyphs in 35 rendered papers (korbmacher 82×, jdm_.2023.16 34×, jdm_m.2022.2 8×). The body channel's `normalize.py` S3 step already expanded ligatures correctly; the leak was confined to **table cells, figure/table captions, and `unstructured-table` fenced blocks**, which bypass `normalize_text` entirely.
+Fix (v2.4.44) — `normalize.py::decompose_ligatures` is now the single shared helper for the full U+FB00-FB06 block, mapping each glyph to ASCII via an explicit table (`ﬁ→fi`, `ﬂ→fl`, `ﬃ→ffi`, `ﬄ→ffl`, `ﬀ→ff`, `ﬅ/ﬆ→st`). An explicit table is used rather than a scoped NFKC pass because NFKC of `ﬅ` (U+FB05) yields `ſt` with a non-ASCII LONG S. The body channel's S3 step calls the helper (and so gains `ﬅ/ﬆ` coverage); `cell_cleaning._html_escape` (table cells) and the `render_pdf_to_markdown` post-process (captions, `unstructured-table` fences, raw_text fallbacks) call it too — the established three-channel glyph-fix pattern.
+Verified across 3 papers: jdm_m.2022.2, korbmacher, jdm_.2023.16 — all now render 0 residual ligature glyphs (was 8 / 82 / 34); `conﬁdent`→`confident`. Superscripts and plain text untouched; the S3 body step still tracks `ligatures_expanded`. 26/26 baseline PASS. 11 tests in `tests/test_ligature_decomposition_real_pdf.py`.
+`NORMALIZATION_VERSION` 1.9.7 → 1.9.8.
+~12 APA papers still FAIL Phase-5d verification; the autonomous run continues.
 ## [2.4.43] — 2026-05-16
 **Cycle 11 (autonomous APA-first run) — single-level numbered section headings demoted to body text (G5a, S1).** Cycle 9 (v2.4.41) promoted multi-level numbered subsection headings (`5.1.`, `6.1.1.`); single-level top-level numbered headings — `2. Omission neglect`, `3. Choice deferral`, `1. Hindsight bias` — were still rendered as plain body text when the title is not a canonical section word.

{docpluck-2.4.43 → docpluck-2.4.45}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.43
+Version: 2.4.45
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://github.com/giladfeldman/docpluck
 Project-URL: Documentation, https://github.com/giladfeldman/docpluck/tree/main/docs

{docpluck-2.4.43 → docpluck-2.4.45}/docpluck/__init__.py RENAMED Viewed

@@ -71,7 +71,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.43"
+__version__ = "2.4.45"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.43 → docpluck-2.4.45}/docpluck/normalize.py RENAMED Viewed

@@ -23,7 +23,7 @@ class NormalizationLevel(str, Enum):
     academic = "academic"
-NORMALIZATION_VERSION = "1.9.7"
+NORMALIZATION_VERSION = "1.9.8"
 # ── Mathematical Alphanumeric Symbols de-styling (shared, v2.4.34) ──────────
@@ -1485,6 +1485,30 @@ def recover_minus_via_ci_pairing(text: str) -> str:
     return "\n".join(out)
+# v2.4.44 (NORMALIZATION_VERSION 1.9.8): decompose Latin typographic
+# ligatures (ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ, U+FB00-FB06). pdftotext preserves these
+# presentation-form glyphs verbatim, so words render as "conﬁdent" /
+# "inﬂuence" — broken for search, word matching, and any downstream NLP.
+# An explicit ASCII table is used (not a scoped NFKC pass): NFKC of U+FB05
+# yields "ſt" with a non-ASCII LONG S, and meta-science output must stay
+# ASCII. This is the SINGLE shared helper for all THREE text channels — the
+# S3 body step (normalize_text, below), table-cell cleaning, and the render
+# post-process. Table cells and figure/table captions bypass normalize_text
+# entirely, so a body-only fix leaves them showing raw ligature glyphs.
+_LIGATURE_MAP = {
+    "ﬀ": "ff", "ﬁ": "fi", "ﬂ": "fl",
+    "ﬃ": "ffi", "ﬄ": "ffl", "ﬅ": "st", "ﬆ": "st",
+}
+_LIGATURE_RE = re.compile("[ﬀ-ﬆ]")
+def decompose_ligatures(text: str) -> str:
+    """Decompose Latin typographic ligatures (U+FB00-FB06) to ASCII."""
+    if not text:
+        return text
+    return _LIGATURE_RE.sub(lambda m: _LIGATURE_MAP[m.group(0)], text)
 def normalize_text(
     text: str,
     level: NormalizationLevel,
@@ -1658,13 +1682,11 @@ def normalize_text(
             t = t.replace(accent + vowel, combined)
     report._track("S2_accent_recombination", before, t, "accents_recombined")
-    # S3: Ligature expansion
+    # S3: Ligature expansion \u2014 body channel. Calls the shared
+    # decompose_ligatures helper (full U+FB00-FB06 block, incl. \ufb05/\ufb06\u2192st) so the
+    # body, table-cell, and render-post-process channels stay in lockstep.
     before = t
-    t = t.replace("\ufb00", "ff")
-    t = t.replace("\ufb01", "fi")
-    t = t.replace("\ufb02", "fl")
-    t = t.replace("\ufb03", "ffi")
-    t = t.replace("\ufb04", "ffl")
+    t = decompose_ligatures(t)
     report._track("S3_ligature_expansion", before, t, "ligatures_expanded")
     # S4: Quote normalization

{docpluck-2.4.43 → docpluck-2.4.45}/docpluck/quality.py RENAMED Viewed

@@ -16,7 +16,7 @@ COMMON_WORDS = {
     "each", "after", "both", "most", "only", "over", "may", "into",
 }
-LIGATURE_CHARS = set("\ufb00\ufb01\ufb02\ufb03\ufb04")
+LIGATURE_CHARS = set("\ufb00\ufb01\ufb02\ufb03\ufb04\ufb05\ufb06")
 def compute_quality_score(text: str) -> dict:

{docpluck-2.4.43 → docpluck-2.4.45}/docpluck/render.py RENAMED Viewed

@@ -34,6 +34,7 @@ from .extract_structured import extract_pdf_structured
 from .normalize import (
     NormalizationLevel,
     _rejoin_garbled_ocr_headers,
+    decompose_ligatures,
     destyle_math_alphanumeric,
     recover_corrupted_lt_operator,
     recover_corrupted_minus_signs,
@@ -232,8 +233,12 @@ def _promote_numbered_subsection_headings(text: str) -> str:
     """Promote ``1.2 Foo``-style lines to ``### 1.2 Foo`` h3 headings.
     Conservative: only multi-level numbering (``N.N`` or deeper), title must
-    start with a capital letter, must not end in sentence-terminator
-    punctuation, and must not look like prose (no long lowercase-word runs).
+    start with a capital letter and must not end in sentence-terminator
+    punctuation. Multi-level dotted numbering at line-start is itself a strong
+    section-heading signal — descriptive subsection titles legitimately run to
+    many lowercase words ("3.3.2.1 The quality of planning on the previous
+    trial moderates the effect of reflection"), so a lowercase-run prose guard
+    mis-rejects real headings and is not applied here (cycle 13, G5b).
     Idempotent: re-running the pass is a no-op.
     """
     if not text:
@@ -249,17 +254,6 @@ def _promote_numbered_subsection_headings(text: str) -> str:
         if title.endswith((".", "?", "!", ":", ",", ";")):
             out.append(line)
             continue
-        tokens = title.split()
-        lc_run = max_lc_run = 0
-        for tok in tokens:
-            if tok and tok[0].islower():
-                lc_run += 1
-                max_lc_run = max(max_lc_run, lc_run)
-            else:
-                lc_run = 0
-        if max_lc_run >= 5:
-            out.append(line)
-            continue
         if out and out[-1].startswith(f"### {m.group('num')} "):
             out.append(line)
             continue
@@ -356,7 +350,7 @@ def _promote_numbered_section_headings(text: str) -> str:
                 max_lc = max(max_lc, lc_run)
             else:
                 lc_run = 0
-        if max_lc >= 5:  # prose-like run — not a heading
+        if max_lc >= 8:  # long prose-like run — not a heading (cycle 13, G5b)
             continue
         candidates.setdefault(int(m.group("num")), []).append((i, title))
     if not candidates:
@@ -2151,6 +2145,12 @@ def render_pdf_to_markdown(
     # B-coefficient table cell, the Mposterior mediation estimates — that
     # the descending-bracket rule structurally cannot see.
     md = recover_minus_via_ci_pairing(md)
+    # v2.4.44: final guarantee — decompose Latin typographic ligatures
+    # (ﬁ->fi, ﬂ->fl, …) from the assembled markdown. normalize (body) and
+    # cell_cleaning (table cells) cover their channels; this catches the
+    # remaining surfaces — figure/table captions, unstructured-table fences,
+    # raw_text fallbacks — so no presentation-form ligature reaches the .md.
+    md = decompose_ligatures(md)
     md = _merge_compound_heading_tails(md)
     md = _reformat_jama_key_points_box(md)
     md = _promote_numbered_subsection_headings(md)

{docpluck-2.4.43 → docpluck-2.4.45}/docpluck/tables/cell_cleaning.py RENAMED Viewed

@@ -37,6 +37,7 @@ import re
 from typing import Sequence
 from docpluck.normalize import (
+    decompose_ligatures,
     destyle_math_alphanumeric,
     recover_corrupted_lt_operator,
     recover_corrupted_minus_signs,
@@ -58,6 +59,10 @@ def _html_escape(s: str | None) -> str:
     # from the Camelot layout channel and bypass normalize_text's S0 step, so
     # math-italic Greek would otherwise leak raw into rendered table HTML.
     s = destyle_math_alphanumeric(s)
+    # Decompose Latin typographic ligatures (ﬁ->fi, ﬂ->fl, …) — table cells
+    # bypass normalize_text, so a cell "conﬁdent" would otherwise leak the
+    # raw presentation-form glyph into the rendered HTML (v2.4.44).
+    s = decompose_ligatures(s)
     # Recover corrupted minus signs. pdfminer (Camelot's text layer) emits
     # "(cid:0)" for a font glyph it cannot map to Unicode; in academic stat
     # tables that unmapped glyph is the U+2212 minus, always printed directly

docpluck-2.4.45/docs/HANDOFF_2026-05-16_ai-gold-instructions.md ADDED Viewed

@@ -0,0 +1,173 @@
+# Instruction — docpluck: AI-gold via article-finder (canonical keys, shared protocol)
+**Date:** 2026-05-16
+**From:** article-finder / cross-project AI-gold coordination
+**To:** docpluck-iterate maintainer / next docpluck session
+**Status:** Action required. article-finder's side is fixed and committed; the items
+below are docpluck's to do in its next iteration.
+---
+## TL;DR
+docpluck must change three things:
+1. **Stop using docpluck's private extraction prompt.** Generate all AI gold through
+   the shared protocol `gold-generation.md` only.
+2. **Key every gold under the paper's canonical DOI.** Bare local stems
+   (`chen_2021_jesp`, `efendic_2022_affect`) are now *rejected* by the cache.
+3. **Regenerate docpluck's existing `reading` golds.** They were produced by the old
+   private prompt and diverge from the shared protocol's output (the "981-line vs
+   617-line" divergence). They are not trustworthy shared ground truth as-is.
+**What article-finder already shipped for you** (skill repo, committed):
+- `generate-gold <pdf>` is now a *routed* invocation — it skips the download cascade
+  and runs `gold-generation.md` directly. Previously it was advertised but unrouted.
+- `gold-generation.md`'s `reading` prompt now transcribes every table **in full,
+  cell-by-cell**, as a markdown grid — not just the caption. This closes the gap
+  that originally justified docpluck's private prompt: the shared `reading` view is
+  now rich enough for docpluck's TABLE verifier.
+- `register-view` and `migrate` now **reject a non-canonical key** with an
+  actionable error.
+- `gold-generation.md` now enforces a **100%-accuracy, zero-hallucination policy**
+  and an **independent Codex / GPT-5.5 cross-model verification**: a second-vendor
+  model re-reads the PDF and audits every gold before it is stored. **Ensure the
+  `codex` CLI is installed and authenticated** in docpluck's environment
+  (`codex --version`; `codex login` if needed) — `generate-gold` blocks without it
+  rather than shipping unverified gold.
+---
+## Why this matters
+The same paper was landing in the cache under two keys — docpluck's short stem
+`chen_2021_jesp` and ESCIcheckapp's PDF-stem `Chen_et_al-2021-JESP-...`. A paper
+split across two keys fragments its record: every project that reads it sees only
+its own slice, `ai-gold.py gaps` reports phantom gaps, and nobody can reuse anyone
+else's work. The cause was every project keying papers its own way. The fix is one
+canonical key per paper, enforced.
+Separately, docpluck's `reading` golds were generated by a docpluck-private prompt
+(`references/ai-full-doc-verify.md`, Step 1b), not the shared `gold-generation.md`.
+Two prompts produce two different "ground truths" for the same PDF. Ground truth
+must be single-source.
+---
+## Rule 1 — Ground truth ONLY through article-finder
+docpluck never re-implements PDF extraction and never carries its own extraction
+prompt. To obtain AI gold for a paper:
+- **Consume first.** Before extracting anything, check the cache:
+  ```
+  python ~/.claude/skills/article-finder/ai-gold.py check <key> --view reading
+  python ~/.claude/skills/article-finder/ai-gold.py get   <key> --view reading
+  ```
+  On a hit, use the cached gold. Zero tokens.
+- **Generate on a miss.** Invoke the `article-finder` skill as `generate-gold <pdf>`.
+  It runs `gold-generation.md` (dual stats extraction → reading+citations carrier
+  pass → cross-check → schema-validated registration) and registers the views under
+  the canonical key. You do not write the extraction logic.
+- **Retire `references/ai-full-doc-verify.md` Step 1b** as a gold *producer*. If
+  docpluck needs an extra verification pass for its own pipeline, that is fine — but
+  the *ground truth* it verifies against comes from the cache, not from that prompt.
+The shared `reading` prompt now captures full cell-by-cell tables, so docpluck's
+table verifier has the detail it needs. If you find it still insufficient, raise it
+with article-finder — do **not** fork the prompt. `gold-generation.md` is owned by
+article-finder and is the single source of extraction rigor.
+## Rule 2 — Canonical keys only
+A cache key must be a **DOI-stem** (`10.1016__j.jesp.2021.104154`) or a
+`fixture__<producer>__<name>` key. Nothing else.
+- Pass the paper's **DOI** as the key to `register-view` / `generate-gold`; the CLI
+  canonicalizes it (`10.1016/j.jesp.2021.104154` → `10.1016__j.jesp.2021.104154`).
+  The DOI is in every gold's `article_metadata.doi` and the `reading` gold's
+  `**DOI:**` line.
+- If a paper genuinely has no DOI, use `fixture__docpluck__<pdf-stem>`.
+- `register-view` and `migrate` now HALT on a bare stem. If docpluck-iterate's
+  autonomous run keys by a local stem, it will fail loudly — fix the iterate skill
+  to resolve and pass the DOI.
+`docpluck.yaml` (the producer manifest) is reflexive — it globs `ai_gold/*/reading.md`
+and keys by the parent directory. Once golds live in canonically-named directories,
+the manifest follows automatically. No manifest edit is needed; the fix is in the
+docpluck-iterate skill that *writes* the golds.
+## Rule 3 — Regenerate the stale `reading` golds
+docpluck's existing `reading` golds in the cache were produced by the old private
+prompt. Regenerate them through `generate-gold` so the cache holds one
+protocol-consistent `reading` view per paper. Priority order: papers other projects
+consume (the fragmented three below) first, then the rest of docpluck's set.
+When you regenerate, the new gold registers under the canonical DOI key. A second
+`reading` gold at the same key supersedes the old one (the cache archives the old
+copy automatically).
+---
+## The three fragmented papers — fix these first
+Each currently has docpluck's `reading` under a short stem AND ESCIcheckapp's
+`reading`+`stats` under a PDF-stem. Regenerate each via `generate-gold` and register
+under the canonical DOI key; the duplicate keys then collapse to one record.
+| Paper | docpluck's current key | Canonical DOI key |
+|---|---|---|
+| Chen et al. 2021, JESP — hindsight bias | `chen_2021_jesp` | `10.1016__j.jesp.2021.104154` |
+| Xiao, Zeng & Feldman 2021, CRSP — decoy effect | `xiao_2021_crsp` | `10.1080__23743603.2021.1878340` |
+| Efendic et al. 2022, SPPS — affect heuristic | `efendic_2022_affect` | `10.1177__19485506211056761` |
+After regeneration, the old short-stem directories (`chen_2021_jesp/` etc.) can be
+removed — coordinate the cleanup with article-finder so `index.json` stays consistent
+(`ai-gold.py audit` must report 0 issues).
+---
+## docpluck's next iteration — step by step
+1. Update docpluck-iterate so its gold step is "invoke `article-finder
+   generate-gold <pdf>`", not the private prompt.
+2. Make docpluck-iterate resolve the paper's DOI and pass it as the key (or let
+   `generate-gold` do it — `gold-generation.md` reads `article_metadata.doi`).
+3. Regenerate `reading` for Chen / Xiao / Efendic; verify each lands under the DOI
+   key with `ai-gold.py views <doi>`.
+4. Regenerate the remaining docpluck `reading` golds through the shared protocol.
+5. Run `ai-gold.py audit` — expect 0 issues. Run `ai-gold.py gaps` to confirm no
+   phantom fragmentation remains.
+6. Commit docpluck's skill change in the docpluck repo; do not commit cache data
+   from docpluck (article-finder owns the cache repo's commits).
+## Command cheat-sheet
+```
+# Resolve a paper to its canonical key
+python ~/.claude/skills/article-finder/ai-gold.py resolve "10.1016/j.jesp.2021.104154"
+# Is the view already cached?
+python ~/.claude/skills/article-finder/ai-gold.py check <key> --view reading
+python ~/.claude/skills/article-finder/ai-gold.py get   <key> --view reading
+# What views does a paper have?
+python ~/.claude/skills/article-finder/ai-gold.py views <key>
+# Generate gold for an uncovered PDF — invoke the article-finder skill:
+#   article-finder generate-gold <absolute-pdf-path>
+# Consistency check (run after any cache change)
+python ~/.claude/skills/article-finder/ai-gold.py audit
+```
+## Definition of done (docpluck)
+- [ ] docpluck-iterate generates gold only via `article-finder generate-gold` /
+      `gold-generation.md`; the private prompt is no longer a gold producer.
+- [ ] Every docpluck gold is registered under a canonical DOI key.
+- [ ] Chen / Xiao / Efendic regenerated and co-located under their DOI keys.
+- [ ] docpluck's other `reading` golds regenerated through the shared protocol.
+- [ ] `ai-gold.py audit` clean.

{docpluck-2.4.43 → docpluck-2.4.45}/docs/TRIAGE_2026-05-14_phase_5d_gold_audit.md RENAMED Viewed

@@ -265,6 +265,16 @@ New `render.py::_promote_numbered_section_headings` promotes `N. Title` → `##
 **G5a RESIDUALS (queued):** the ≥5-lowercase-word prose guard rejects long descriptive headings (`4. Knowledge acquisition, decision delay, and choice outcomes`) — same G5b guard issue; list-number collision under-promotes a section heading whose number a body list reuses (chen 1/2/3/5 — conservative, not a false positive).
+### Cycle 12 (v2.4.44) — GLYPH ligature decomposition — SHIPPED
+`normalize.py::decompose_ligatures` is the single shared helper for the U+FB00-FB06 ligature block — an explicit ASCII table (`ﬁ→fi`, `ﬂ→fl`, …, `ﬅ/ﬆ→st`; NFKC is avoided because `ﬅ`→`ſt` carries a non-ASCII long s). **The body channel's S3 step already expanded ligatures** — the real gap was the table-cell, figure/table-caption, and `unstructured-table`-fence channels that bypass `normalize_text`. The helper is now called from all three channels (S3 body / `cell_cleaning._html_escape` / `render_pdf_to_markdown` post-process); the S3 step also gains `ﬅ/ﬆ` reach. Corpus scan found ligatures in 35 rendered papers (korbmacher 82×, jdm16 34×); jdm_m2/korbmacher/jdm16 verified → 0 residual. The `GLYPH ligature` row below is now RESOLVED.
+> **Cycle-12 rework note (run 4, 2026-05-16):** the first cycle-12 attempt added a SECOND, parallel `decompose_ligatures` call *before* the pre-existing S3 step inside `normalize_text` — it consumed every ligature before S3 ran, so S3 tracked `ligatures_expanded = 0` and broke `test_normalization.py::test_report_tracks_changes`. The rework removed the duplicate call and unified S3 to use the shared helper. Lesson: before adding a glyph-normalization helper, grep the existing `normalize_text` S-steps for one already handling that glyph class — extend/unify it, do not add a parallel path.
+### Cycle 13 (v2.4.45) — G5b long-descriptive-title prose guard — SHIPPED
+`render.py`'s numbered-heading promoters carried a `max_lc_run >= 5` "long lowercase-word run" prose guard that mis-rejected legitimate descriptive headings. Reproduced at HEAD: jdm_.2023.16 alone had **19** multi-level numbered subsection headings demoted to body text, with `max_lc` up to **12** (`3.3.2.1. The quality of planning on the previous trial moderates the effect of reflection`) — far deeper than the TRIAGE's "raise 5→8" estimate. Re-scoped: the lc-run guard is **removed entirely from `_promote_numbered_subsection_headings`** (multi-level dotted numbering + capital-start + no-terminal-punctuation + single ≤80-char line is itself a sufficient heading signature; the lc-run guard cannot distinguish a descriptive heading from prose). For `_promote_numbered_section_headings` (single-level `N.`, real list-collision risk) the guard is kept but raised `5→8`, alongside its existing numbering-range/uniqueness/list-adjacency gates. jdm16: 19 headings recovered; v2.4.44→v2.4.45 diff is heading-promotion only (0 text loss/hallucination); 26/26 baseline.
 ### SESSION-3 STANDING VERDICT (rule 0e-bis)
 The APA corpus is **NOT clean**. Cycles 8-11 shipped 4 verified incremental fixes (v2.4.40-43), each AI-gold-verified OVERALL PASS with 0 regressions. But ~12 APA papers still FAIL Phase-5d on PRE-EXISTING defects the cycles did not reach. Verifier-confirmed open punch-list:
@@ -274,7 +284,7 @@ The APA corpus is **NOT clean**. Cycles 8-11 shipped 4 verified incremental fixe
 | **TABLE structure destruction** | S0/S1 | efendic, ar_apa_011, xiao, jdm15/16, chen, maier, ip_feldman (~11) | grid lost → caption-bleed; flat number-dump; empty `<table>` shells; two tables merged; rows dropped. C3 — needs a render/structured coordination design. The single largest blocker. |
 | **G5c split-line numbered headings** | S1 | jdm_m.2022.2 (`5.3.`/`6.3.`/`7.3.` etc.) | number alone on a line, title on the next; renders as orphan bare-number + a MISLABELED generic `## Results`. cycle-3 orphan-folder multi-level analogue. |
 | **G5d named (unnumbered) heading demotion** | S1 | ar_apa_011 (`Participants`, `Overview`), efendic, chandrashekar, ip_feldman (~7) | section-partitioner work; largest false-positive surface. |
-| **G5b long-descriptive-title prose guard** | S1 | jdm16, jdm_m2, chen | `≥5-lowercase-word` guard over-rejects legit long numbered headings. |
+| ~~**G5b long-descriptive-title prose guard**~~ ✓ FIXED v2.4.45 (cycle 13) | S1 | jdm16, jdm_m2, chen | ~~`≥5-lowercase-word` guard over-rejects legit long numbered headings.~~ Subsection promoter's lc-run guard removed; single-level raised 5→8. |
 | **FIG caption double-emission + truncation** | S2 | jdm_m2, efendic, chan_feldman, ziano, jdm15/16 (~8) | caption inline + in `## Figures` block; truncated mid-word; figure data-labels as orphan body lines. |
 | **GLYPH ligature** `ﬁ`/`ﬂ` not decomposed | S2 | jdm_m2 (and likely many) | `conﬁdent`, `inﬂuence` — NFKC would fix; check why current NFC pass misses U+FB01/FB02. |
 | **D4 metadata residuals** | S2 | ar_apa_011 (`doi:` line), chen, efendic masthead | see D4 RESIDUALS above. |

{docpluck-2.4.43 → docpluck-2.4.45}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "docpluck"
-version = "2.4.43"
+version = "2.4.45"
 description = "PDF, DOCX, and HTML text extraction and normalization for academic papers"
 readme = "docs/README.md"
 requires-python = ">=3.10"

{docpluck-2.4.43 → docpluck-2.4.45}/tests/golden/sections/apa_multi_study_pdf.json RENAMED Viewed

@@ -6,7 +6,7 @@
       "label": "abstract",
       "canonical_label": "abstract",
       "char_start": 0,
-      "char_end": 28,
+      "char_end": 29,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -15,8 +15,8 @@
     {
       "label": "introduction",
       "canonical_label": "introduction",
-      "char_start": 28,
-      "char_end": 53,
+      "char_start": 29,
+      "char_end": 55,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -25,8 +25,8 @@
     {
       "label": "methods",
       "canonical_label": "methods",
-      "char_start": 53,
-      "char_end": 78,
+      "char_start": 55,
+      "char_end": 81,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -35,8 +35,8 @@
     {
       "label": "results",
       "canonical_label": "results",
-      "char_start": 78,
-      "char_end": 103,
+      "char_start": 81,
+      "char_end": 107,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -45,8 +45,8 @@
     {
       "label": "methods_2",
       "canonical_label": "methods",
-      "char_start": 103,
-      "char_end": 128,
+      "char_start": 107,
+      "char_end": 133,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -55,8 +55,8 @@
     {
       "label": "results_2",
       "canonical_label": "results",
-      "char_start": 128,
-      "char_end": 153,
+      "char_start": 133,
+      "char_end": 159,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -65,8 +65,8 @@
     {
       "label": "general_discussion",
       "canonical_label": "general_discussion",
-      "char_start": 153,
-      "char_end": 183,
+      "char_start": 159,
+      "char_end": 190,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -75,8 +75,8 @@
     {
       "label": "references",
       "canonical_label": "references",
-      "char_start": 183,
-      "char_end": 213,
+      "char_start": 190,
+      "char_end": 220,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",

{docpluck-2.4.43 → docpluck-2.4.45}/tests/golden/sections/apa_single_study_pdf.json RENAMED Viewed

@@ -6,7 +6,7 @@
       "label": "abstract",
       "canonical_label": "abstract",
       "char_start": 0,
-      "char_end": 36,
+      "char_end": 37,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -15,8 +15,8 @@
     {
       "label": "introduction",
       "canonical_label": "introduction",
-      "char_start": 36,
-      "char_end": 61,
+      "char_start": 37,
+      "char_end": 63,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -25,8 +25,8 @@
     {
       "label": "methods",
       "canonical_label": "methods",
-      "char_start": 61,
-      "char_end": 84,
+      "char_start": 63,
+      "char_end": 87,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -35,8 +35,8 @@
     {
       "label": "results",
       "canonical_label": "results",
-      "char_start": 84,
-      "char_end": 108,
+      "char_start": 87,
+      "char_end": 112,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -45,8 +45,8 @@
     {
       "label": "discussion",
       "canonical_label": "discussion",
-      "char_start": 108,
-      "char_end": 133,
+      "char_start": 112,
+      "char_end": 138,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",
@@ -55,8 +55,8 @@
     {
       "label": "references",
       "canonical_label": "references",
-      "char_start": 133,
-      "char_end": 163,
+      "char_start": 138,
+      "char_end": 168,
       "pages": [],
       "confidence": "high",
       "detected_via": "heading_match",

docpluck 2.4.43__tar.gz → 2.4.45__tar.gz

docpluck 2.4.43tar.gz → 2.4.45tar.gz