PyPI - docpluck - Versions diffs - 2.4.45__tar.gz → 2.4.47__tar.gz - Mend

docpluck 2.4.45tar.gz → 2.4.47tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (327) hide show

{docpluck-2.4.45 → docpluck-2.4.47}/.claude/skills/_project/lessons.md RENAMED Viewed

@@ -198,3 +198,11 @@ Plus three golden snapshot files (`tests/golden/sections/*.json`) had the versio
 **Why:** A lowercase-word-run count genuinely cannot distinguish a descriptive section heading from prose — both have many lowercase words. What makes a line a heading is the *number shape* + capital-start + no-terminal-punctuation + single short line. For **multi-level** dotted numbering (`N.N[.N…]`) that signature is decisive — a prose line almost never begins with a multi-level dotted number — so the lc-run guard was pure harm. For **single-level** `N.` numbering the signature is weak (a `2.` line collides with an enumerated-list item), so a prose guard there still adds value as defense-in-depth. Same guard, opposite verdicts, because the false-positive surface differs between the two call sites.
 **How to detect (next time):** When a heuristic guard rejects legitimate inputs, do not just retune its threshold — ask whether the guard discriminates at all at that call site. Reproduce at HEAD and measure the metric's spread on real positives (here: heading lowercase-runs ran 0-12, overlapping prose entirely → no threshold works). If a guard can't separate the classes, remove it where the *other* gates already suffice and keep it only where they don't. When a guard is removed, grep its tests — a contract test (`test_render.py::test_promote_rejects_prose_with_long_lowercase_run`) was asserting the removed behavior and had to be updated in the same cycle.
+## 2026-05-16 · Cycle G5c-1 — a render-layer fold can only fix the immediately-adjacent case; resist generalizing it into partitioner work (v2.4.46)
+**What:** pdftotext splits a numbered subsection heading `5.4. Discussion` into a bare `5.4.` line + a separate `Discussion` line; the section partitioner promotes the lone word to a generic `## Discussion` and strands the number. New render post-processor `_fold_orphan_multilevel_numerals_into_headings` (the multi-level analogue of the existing arabic/roman orphan folders) folds `5.4.`⏎`## Discussion` → `### 5.4. Discussion`. In the target paper jdm_m.2022.2 there are 6 such orphan numbers but only 1 (`5.4.`) is foldable — the other 5 are followed by a figure block or by body prose (the title word was consumed elsewhere by the partitioner), so there is no heading to fold into. The cycle fixed exactly the 1 foldable case and left the other 5 for G5c-2 (partitioner split-heading rejoin).
+**Why:** The structural signature a render-layer fold can act on is "orphan number + blank-line-only adjacency to a heading." When the partitioner has already consumed the title word elsewhere, the heading does not exist at that position — no render post-processor can recover it; that needs partitioner-level work. Folding `5.3.` into the `### Figure 1` that happens to sit below it (the nearest heading) would emit `### 5.3. Figure 1` — actively wrong. Two guards encode this: the fold target excludes `### Figure N`/`### Table N` (library-emitted structural markers) and already-numbered headings, and only blank-line-only adjacency counts.
+**How to detect (next time):** When a TRIAGE item lists N instances of a defect in one paper, reproduce ALL N at HEAD before scoping — they may not share a fixable shape. Here 1/6 was a render-layer fold and 5/6 were partitioner work; the cycle-14 investigation had already split G5c into G5c-1 (render) + G5c-2 (partitioner). A fix that "only handles 1 of 6" is correct when the other 5 are a genuinely different defect class — do not stretch a render-layer regex to chase cases that have no anchor for it.

{docpluck-2.4.45 → docpluck-2.4.47}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -547,3 +547,23 @@ This run executed `docs/HANDOFF_2026-05-16_iterate_run_4_fix_and_continue.md`'s
 ### Process note — Codex cross-model verification has a Windows UTF-8 bug
 The `gold-generation.md` Step-4 Codex audit misreads UTF-8 gold files as mojibake on this Windows machine (`Västfjäll`→`VA<SI>stfjA<SI>ll`, `–`→`ƒ?"`), producing ~10-24 false "discrepancies" per paper. The gold files are confirmed clean UTF-8. Worked around by re-running Codex with an explicit "files are UTF-8; mojibake is your decode error, not a discrepancy" preamble. **This is article-finder's protocol to fix** — `gold-generation.md` Step 4 needs a UTF-8 read instruction for Windows. Flagged for coordination with the article-finder owner.
+---
+## Run: 2026-05-16 (run 5) · Cycle G5c-1 · v2.4.46
+### Outcome
+- SHIPPED v2.4.46. New render post-processor `_fold_orphan_multilevel_numerals_into_headings` folds an orphan multi-level `N.N.` number into the immediately-adjacent generic `##`/`###` heading: `5.4.`⏎`## Discussion` → `### 5.4. Discussion`. jdm_m.2022.2's `5.4. Discussion` recovered, AI-gold-verified correct. Three-tier parity byte-identical (Tier1=Tier2=Tier3). 26/26 baseline. 11 new tests.
+### Blind Spots
+- **The fix's scope is exactly 1 of 6 orphan numbers in the target paper — and that is correct.** jdm_m.2022.2 has 6 orphan multi-level numbers (`5.3.`/`5.4.`/`6.3.`/`6.4.`/`7.3.`/`7.4.`). Only `5.4.` is followed by a generic heading; the cycle-14 investigation already established the other 5 are partitioner title-loss (G5c-2). A render-layer fold can ONLY fix the immediately-adjacent case. Resisting the temptation to "make the fix fold all 6" is the discipline — the other 5 have no heading to fold into (the title word was consumed elsewhere by the partitioner). Folding `5.3.` into the `### Figure 1` that happens to sit below it would be actively wrong.
+### Edge Cases
+- **Must exclude `### Figure N` / `### Table N` from fold targets.** These are library-emitted structural markers, not section headings. `5.3.` sits immediately above `### Figure 1` in jdm_m.2022.2 — a naive "fold into next heading" would emit `### 5.3. Figure 1`. The fold-target regex carries `(?!Figure\b)(?!Table\b)` negative lookaheads plus `(?!\s*\d)` (skip already-numbered headings — also gives idempotency).
+- **Heading level is decided by the orphan number's depth, not the adjacent heading's level.** A multi-level dotted number always denotes a subsection, so the fold emits `### ` even when the partitioner had promoted the stranded title to `## ` (demote `## Discussion` → `### 5.4. Discussion`). This matches `_NUMBERED_SUBSECTION_HEADING_RE`, which likewise emits `### ` at any dotted depth.
+### Verification Gaps
+- None new. Phase 5d AI-gold verify worked as designed: it confirmed the cycle's one change correct AND surfaced the pre-existing G5c-2 / HALLUC-HEAD / FIG defects (all already in TRIAGE). The verifier returning FAIL on a paper whose cycle-specific diff is clean is the EXPECTED rule-0e-bis behavior — the cycle ships incrementally; the run's standing verdict stays FAIL while the corpus is not clean.
+### SPINE-SKIPs
+- Phase 7 `/docpluck-cleanup` + `/docpluck-review` sub-skills — SKIPPED (user chose the lean-checks release path for context budget). Essential hard-rule checks done inline: render-layer regex-only change, no `-layout`/AGPL/tool-swap/U+2212/ImportError/HTML-table surface touched; general fix keyed on a structural signature (no paper identity); real-PDF test added; no `.pdf` staged. 26/26 baseline + heading-markup-only three-tier diff confirm no regression.

{docpluck-2.4.45 → docpluck-2.4.47}/.claude/skills/docpluck-iterate/references/ai-full-doc-verify.md RENAMED Viewed

@@ -68,6 +68,8 @@ generate-gold <absolute-path-to-PDF>
 `article-finder`'s `generate-gold` runs `gold-generation.md` end to end — the canonical extraction prompt(s), anti-hallucination rules, cross-check, schema validation, and `register-view` storage under the canonical key. It produces (and registers) the `reading`, `citations`, and `stats` views in one pass. This is the ONLY sanctioned way for docpluck-iterate to obtain a gold that does not already exist.
+> **Prerequisite — the `codex` CLI must be installed AND authenticated.** `gold-generation.md` runs an independent Codex / GPT-5.5 cross-model audit of every gold before storage, and `generate-gold` **blocks** (rather than shipping unverified gold) if `codex` is missing or unauthenticated. Verify ahead of time with `codex --version`. If a `generate-gold` call fails on the cross-model audit step, the user must run `codex login` — it is interactive, so the skill cannot do it; surface this to the user as a blocker rather than falling back to a local extraction.
 After `generate-gold` completes, the `reading` view is in the shared cache — copy it to the working path exactly as in the cache-HIT branch:
 ```bash

{docpluck-2.4.45 → docpluck-2.4.47}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,25 @@
 # Changelog
+## [2.4.47] — 2026-05-16
+**Cycle FIG-1 (APA-first run) — figure caption truncated mid-word with an ellipsis (FIG, S2).** When pdftotext welds a figure's following body prose onto its caption with only a single newline (no `\n\n` paragraph break), the `_extract_caption_text` paragraph-walk cannot find a stopping point and absorbs body prose up to the 800-char hard cap. The old 400-char cap then cut the caption mid-word and appended `…` — e.g. `jdm_m.2022.2` Figure 1 absorbed the `H1 :` hypothesis statement and Figure 3 absorbed a `(N = 61) performed …` body sentence, both ending in a fragment. A corpus scan found 12 such truncated figure captions across 6 APA papers.
+Fix (v2.4.47) — new helper `_trim_overflowing_figure_caption` in `extract_structured.py`. When a figure caption overflows the 400-char hard cap (which, in the 17-paper APA corpus, only ever happens on an over-absorbed caption — no legitimate figure caption exceeds ~360 chars), it walks the cap window back to the last genuine sentence terminator instead of hard-truncating mid-word. Abbreviation periods (`vs.`, `e.g.`, author initials) are skipped so the caption is not cut mid-clause, and the surviving caption is required to keep real description content past its label. Keyed purely on the structural signature (caption overflow + sentence boundary), figures only — table captions keep the existing `_trim_table_caption_at_cell_region` path.
+`jdm_m.2022.2` Figures 1 and 3 are recovered exactly to the AI gold; all 12 ellipsis-truncated captions across the APA corpus are eliminated (0 remain, 0 over-400). Phase-5d AI-gold verify across 28 figures in 6 papers: 0 text-loss, 0 ellipsis-truncated. 6 captions retain partial trailing body prose (a sentence-terminated residual — queued as cycle FIG-2). The v2.4.46→v2.4.47 diff is figure-caption-text only (0 body text loss, 0 hallucination — the absorbed prose remains intact in the body). 26/26 baseline PASS. New real-PDF + contract tests in `tests/test_figure_caption_trim_real_pdf.py`.
+~11 APA papers still FAIL Phase-5d verification (FIG-2 caption residual + double-emission, G5c-2 partitioner split-heading rejoin, HALLUC-HEAD, TABLE cluster, COL); the run continues.
+## [2.4.46] — 2026-05-16
+**Cycle G5c-1 (APA-first run) — orphan multi-level section number stranded above its heading (G5c, S1).** pdftotext sometimes splits a numbered subsection heading such as `5.4. Discussion` into a bare `5.4.` line and a separate `Discussion` line; the section partitioner then promotes the lone title word to a generic `## Discussion` and strands the number on its own line. In `jdm_m.2022.2` the `5.4. Discussion` subsection of Study 1 rendered as an orphan `5.4.` followed by a top-level `## Discussion`.
+Fix (v2.4.46) — new render post-processor `_fold_orphan_multilevel_numerals_into_headings`, the multi-level analogue of `_fold_orphan_arabic_numerals_into_headings` / `_fold_orphan_roman_numerals_into_headings`. It folds an orphan `N.N.` number into the **immediately-adjacent** generic `##`/`###` heading and emits it at subsection level: `5.4.`⏎`## Discussion` → `### 5.4. Discussion`. Keyed purely on the structural signature (an isolated multi-level dotted number is itself a strong subsection marker — body prose and list items never emit a bare `5.4.` line) plus blank-line-only adjacency. `### Figure N` / `### Table N` (library-emitted structural markers) and already-numbered headings are excluded. Only the immediately-adjacent case is folded; an orphan number whose title word the partitioner consumed elsewhere (leaving body prose below the number) is partitioner-level work (G5c-2) and is left untouched.
+`jdm_m.2022.2`: the `5.4. Discussion` heading is recovered and AI-gold-verified correct. The v2.4.45→v2.4.46 diff is heading-markup only (0 text loss, 0 hallucination). 26/26 baseline PASS. New real-PDF + contract tests in `tests/test_orphan_multilevel_number_real_pdf.py`.
+~11 APA papers still FAIL Phase-5d verification (G5c-2 partitioner split-heading rejoin, HALLUC-HEAD, FIG caption double-emission, TABLE cluster, COL); the run continues.
 ## [2.4.45] — 2026-05-16
 **Cycle 13 (autonomous APA-first run) — long descriptive numbered headings demoted to body text (G5b, S1).** `render.py`'s numbered-heading promoters carried a "long lowercase-word run" prose guard (`max_lc_run >= 5`) that rejected legitimate descriptive headings — e.g. `2.4.2.2. Inference of planning strategies and strategy types`, `3.3.2.1. The quality of planning on the previous trial moderates the effect of reflection`. jdm_.2023.16 alone had 19 multi-level numbered subsection headings demoted to body text.

{docpluck-2.4.45 → docpluck-2.4.47}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.45
+Version: 2.4.47
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://github.com/giladfeldman/docpluck
 Project-URL: Documentation, https://github.com/giladfeldman/docpluck/tree/main/docs

{docpluck-2.4.45 → docpluck-2.4.47}/docpluck/__init__.py RENAMED Viewed

@@ -71,7 +71,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.45"
+__version__ = "2.4.47"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.45 → docpluck-2.4.47}/docpluck/extract_structured.py RENAMED Viewed

@@ -398,7 +398,15 @@ def _extract_caption_text(
         snippet = _trim_caption_at_running_header_tail(snippet)
         snippet = _trim_caption_at_body_prose_boundary(snippet)
     if len(snippet) > 400:
-        snippet = snippet[:400].rsplit(" ", 1)[0] + "…"
+        # Figure captions: an overflow is over-absorbed body prose — walk
+        # back to the last real sentence terminator so the caption ends
+        # cleanly instead of being cut mid-word with an ellipsis. Tables
+        # keep the mid-word cap (their overflow is linearized cell text
+        # already bounded by ``_trim_table_caption_at_cell_region``).
+        if cap.kind == "figure":
+            snippet = _trim_overflowing_figure_caption(snippet)
+        else:
+            snippet = snippet[:400].rsplit(" ", 1)[0] + "…"
     return snippet
@@ -837,6 +845,65 @@ def _looks_like_body_prose(tail: str) -> bool:
     return _BODY_PROSE_SIGNAL_RE.search(tail) is not None
+# Abbreviations whose trailing period is NOT a sentence terminator. Used
+# by the figure-caption overflow walk-back so it doesn't cut a caption
+# mid-clause at "vs.", "e.g.", an author initial, etc.
+_CAPTION_NON_TERMINAL_ABBREV = frozenset({
+    "vs", "e.g", "i.e", "cf", "fig", "figs", "no", "nos", "eq", "eqs",
+    "al", "etc", "dr", "mr", "mrs", "ms", "prof", "ca", "approx",
+    "ref", "refs", "ed", "eds", "vol", "pp", "sd", "se", "ns",
+})
+# A sentence-terminator followed by whitespace or end-of-string:
+# ``.``/``!``/``?`` with an optional closing bracket/quote. ``m.start()``
+# lands on the terminator char itself.
+_CAPTION_SENTENCE_END_RE = re.compile(r"[.!?][\"'\)\]]?(?=\s|$)")
+# The label prefix of a caption (``Figure 12.`` / ``Fig. 3`` / ``Table 1.``).
+# Used to ensure the overflow walk-back keeps real description content and
+# doesn't collapse the caption to just its label.
+_CAPTION_LABEL_PREFIX_RE = re.compile(
+    r"(?:figure|fig\.?|table)\s+\d+(?:\.\d+)?\.?\s*", re.IGNORECASE
+)
+def _trim_overflowing_figure_caption(snippet: str, limit: int = 400) -> str:
+    """Trim a figure caption that overflows the hard cap back to its last
+    real sentence terminator instead of hard-truncating mid-word.
+    A figure caption longer than ``limit`` chars (after every targeted
+    trim above has already run) is, in the academic-PDF corpus, always a
+    caption that absorbed following body prose — no legitimate figure
+    caption in the 17-paper APA test corpus exceeds ~360 chars. The old
+    behavior (``snippet[:400].rsplit(" ", 1)[0] + "…"``) cut the caption
+    mid-word and appended an ellipsis, leaving the user a caption ending
+    in a fragment. This walks the cap window back to the last genuine
+    sentence terminator (skipping abbreviations like ``vs.`` and author
+    initials) so the caption ends cleanly on a real sentence boundary.
+    Keyed purely on the structural signature (caption overflow + sentence
+    boundary), not on paper identity. Falls back to the mid-word cap only
+    when no usable terminator exists past the label.
+    """
+    head = snippet[:limit]
+    label = _CAPTION_LABEL_PREFIX_RE.match(head)
+    label_end = label.end() if label else 0
+    best = -1
+    for m in _CAPTION_SENTENCE_END_RE.finditer(head):
+        word = re.search(r"([A-Za-z.]+)$", head[: m.start() + 1])
+        tok = word.group(1).rstrip(".").lower() if word else ""
+        if tok in _CAPTION_NON_TERMINAL_ABBREV:
+            continue
+        if len(tok) == 1 and tok.isalpha():  # author initial, e.g. "J."
+            continue
+        best = m.start() + 1  # keep through the terminator char
+    # The terminator must sit past the label, so the surviving caption
+    # retains real description content (not just ``Figure N.``).
+    if best > label_end:
+        return snippet[:best].rstrip()
+    return snippet[:limit].rsplit(" ", 1)[0] + "…"
 def _isolated_table_from_caption(
     cap: CaptionMatch,
     raw_text: str,

{docpluck-2.4.45 → docpluck-2.4.47}/docpluck/render.py RENAMED Viewed

@@ -772,6 +772,48 @@ def _fold_orphan_arabic_numerals_into_headings(text: str) -> str:
     return pattern.sub(repl, text)
+def _fold_orphan_multilevel_numerals_into_headings(text: str) -> str:
+    """Cycle G5c-1: fold an orphan multi-level section-number line (``5.4.``,
+    ``6.1.2.``) into the immediately following generic ``##``/``###`` heading.
+    Multi-level analogue of :func:`_fold_orphan_arabic_numerals_into_headings`.
+    pdftotext sometimes splits ``5.4. Discussion`` into a bare ``5.4.`` line
+    and a separate ``Discussion`` line; the section partitioner then promotes
+    the lone title word to a generic ``## Discussion`` and strands the number::
+        5.4.\\n\\n## Discussion  →  ### 5.4. Discussion
+    A multi-level dotted number alone on a line is itself a strong subsection
+    signal — body prose and list items do not emit a bare ``5.4.`` line — so
+    the fold is keyed purely on that structural signature plus blank-line-only
+    adjacency to a heading. The result is always ``### ``: multi-level
+    numbering denotes a subsection regardless of the level the partitioner
+    happened to give the stranded title (cf. ``_NUMBERED_SUBSECTION_HEADING_RE``,
+    which likewise emits ``### `` at any depth).
+    The fold target must be a *generic* heading. ``### Figure N`` / ``### Table N``
+    are library-emitted structural markers, and a heading already starting with
+    a number is a real numbered section — both are excluded (the latter also
+    keeps the pass idempotent). Only the immediately-adjacent case is folded;
+    an orphan number separated from its heading by a figure block or by body
+    prose (the title word consumed elsewhere) is partitioner-level work and is
+    left untouched here.
+    """
+    if not text:
+        return text
+    pattern = re.compile(
+        r"(?m)^(\d+(?:\.\d+){1,3})\.?[ \t]*\n(?:[ \t]*\n)+"
+        r"(?P<head>#{2,3} (?!\s*\d)(?!Figure\b)(?!Table\b)[^\n]+)"
+    )
+    def repl(m: re.Match) -> str:
+        num = m.group(1)
+        head_text = m.group("head").split(" ", 1)[1]
+        return f"### {num}. {head_text}"
+    return pattern.sub(repl, text)
 def _promote_study_subsection_headings(text: str) -> str:
     """Promote ``Study N Design and Findings`` etc. to ``### {title}``.
@@ -2159,6 +2201,10 @@ def render_pdf_to_markdown(
     # heading post-processors so it operates on the final heading shapes.
     md = _fold_orphan_roman_numerals_into_headings(md)
     md = _fold_orphan_arabic_numerals_into_headings(md)
+    # Cycle G5c-1: multi-level analogue — fold an orphan `N.N.` number line
+    # into the immediately following generic heading (`5.4.`\n\n`## Discussion`
+    # -> `### 5.4. Discussion`). Runs alongside the single-level folders.
+    md = _fold_orphan_multilevel_numerals_into_headings(md)
     # Cycle 11 (G5a): promote single-level `N. Title` lines to `## N. Title`,
     # gated on the document already numbering its sections. Runs AFTER the
     # orphan-numeral folders so `## 1. Introduction` exists as an anchor.

{docpluck-2.4.45 → docpluck-2.4.47}/docs/HANDOFF_2026-05-16_ai-gold-instructions.md RENAMED Viewed

@@ -30,12 +30,10 @@ docpluck must change three things:
   now rich enough for docpluck's TABLE verifier.
 - `register-view` and `migrate` now **reject a non-canonical key** with an
   actionable error.
-- `gold-generation.md` now enforces a **100%-accuracy, zero-hallucination policy**
-  and an **independent Codex / GPT-5.5 cross-model verification**: a second-vendor
-  model re-reads the PDF and audits every gold before it is stored. **Ensure the
-  `codex` CLI is installed and authenticated** in docpluck's environment
-  (`codex --version`; `codex login` if needed) — `generate-gold` blocks without it
-  rather than shipping unverified gold.
+- `gold-generation.md` enforces a **100%-accuracy, zero-hallucination policy**:
+  dual independent extraction + cross-check, verbatim anchors, schema validation.
+  (An independent cross-vendor verification pass is planned but not yet live — see
+  article-finder's `TODO.md`.)
 ---

docpluck-2.4.47/docs/HANDOFF_2026-05-16_iterate_run_4_final.md ADDED Viewed

@@ -0,0 +1,109 @@
+# Handoff — docpluck-iterate run 4 (fix-and-continue) — FINAL
+**Authored:** 2026-05-16, end of run 4. **For:** a fresh `/docpluck-iterate` session.
+This run executed `docs/HANDOFF_2026-05-16_iterate_run_4_fix_and_continue.md`'s three jobs.
+## State at handoff
+- Last shipped library version: **v2.4.45** (tag pushed, PyPI not published).
+- docpluckapp `service/requirements.txt` pin: auto-bumped to **v2.4.45** (commit `26bf88f9`).
+- Production `/_diag`: `docpluck_version = 2.4.45` — verified.
+- 26-paper baseline: **26/26 PASS** at v2.4.45.
+- Broad pytest: **0 failures** (the 15 pre-existing failures are resolved — see tests-regen).
+## What run 4 shipped
+| Item | Version | Outcome |
+|---|---|---|
+| JOB 1 — cycle 12 ligature rework | v2.4.44 | SHIPPED + prod-verified |
+| JOB 3 — tests-regen | (no bump) | SHIPPED (`c831e28`) |
+| JOB 3 — cycle 13 (G5b) | v2.4.45 | SHIPPED + prod-verified |
+| JOB 2 — 3 fragmented golds | (cache) | REGISTERED under canonical DOI keys |
+| skill — codex prerequisite note | (no bump) | committed `9aa4f5b` |
+### JOB 1 — cycle 12 ligature rework (v2.4.44)
+The session-3 cycle-12 attempt was broken (a duplicate `decompose_ligatures` call
+before the pre-existing S3 step starved S3's tracking; `test_report_tracks_changes`
+red). Reworked: removed the duplicate, unified S3 to call the single shared helper
+(explicit U+FB00-FB06 ASCII table — NFKC of `ﬅ` yields a non-ASCII long-s), kept the
+genuine `cell_cleaning` + render-post-process calls (table/caption/fence channels
+bypass `normalize_text`). Stale narrative in CHANGELOG/LEARNINGS/lessons/TRIAGE
+corrected. v2.4.44→ diff is ligature-only on korbmacher/jdm_m2/jdm16; 26/26.
+### JOB 3 — tests-regen (`c831e28`, no version bump)
+12 `test_extract_pdf_byte_identical` snapshots + 2 `test_sections_golden` goldens
+regenerated (environmental pdftotext line-wrap drift; `extract_pdf` is a pure
+pdftotext passthrough). The 15th failure, `test_request_09`, is **NOT** snapshot
+drift — it is a real **COL-class** column-interleave defect (the numbered RSOS
+bibliography renders as `References\n1. 2. 3. … 16.\n\nThaler RH…`, the number column
+split from the entry text). Left red; tracked as the COL class.
+### JOB 3 — cycle 13 (G5b, v2.4.45)
+`render.py`'s numbered-heading promoters carried a `max_lc_run >= 5` prose guard
+that demoted long descriptive headings. Reproduction showed real headings with
+lowercase-runs up to 12 — the count cannot separate heading from prose, so "raise
+5→8" would have been a partial fix. Removed the guard from
+`_promote_numbered_subsection_headings` (multi-level dotted numbering is itself the
+discriminator); kept it raised 5→8 in `_promote_numbered_section_headings`
+(single-level numbers collide with enumerated lists). jdm_.2023.16: 19 headings
+recovered; v2.4.44→v2.4.45 diff heading-promotion-only; 26/26.
+### JOB 2 — 3 fragmented golds (Chen / Xiao / Efendic)
+Regenerated all four views (`stats` / `reading` / `citations` / `intext_citations`)
+for each of the 3 papers through `gold-generation.md` — dual stats extraction +
+cross-check + reading/citations/intext carrier pass (12 subagents). Registered all
+12 views under canonical DOI keys (`10.1016__j.jesp.2021.104154`,
+`10.1080__23743603.2021.1878340`, `10.1177__19485506211056761`), producer
+`article-finder`; `ai-gold.py audit` clean.
+**Codex Step-4 cross-model verification was SKIPPED** (explicit user directive,
+2026-05-16) because the `codex` CLI has a Windows UTF-8 file-read bug — it misreads
+UTF-8 gold files as mojibake, flooding the verdict with false discrepancies. A full
+report is at `~/ArticleRepository/docs/handoffs/2026-05-16_codex-cli-windows-encoding-issue.md`
+(handed to the article-finder skill owner). The UTF-8-corrected Codex re-runs still
+found **genuine** gold discrepancies (chen 28 / xiao 22 / efendic 19 — real citation
+page-range swaps, missing title prefixes, a few wrong table cells) — the verdict
+files are saved at `~/ArticleRepository/tmp/goldgen_run4/{chen,xiao,efendic}_verdict2.txt`
+for article-finder's Step-4 fix-loop. The regenerated golds are dual-extracted +
+cross-checked and supersede the old docpluck private-prompt golds, but they have NOT
+passed Codex — a fix-loop is still owed (now article-finder's, per the directive).
+## Open queue — JOB 3 remaining APA defect cycles (the run did NOT finish JOB 3)
+**Standing verdict (rule 0e-bis): the APA corpus is NOT clean — ~11 papers still
+FAIL Phase-5d.** This run shipped cycle 13 and re-scoped G5c; the cycles below remain.
+Recommended order:
+1. **G5c-1** — render-layer fold of an orphan multi-level `N.N.` line into an
+   adjacent generic `##/###` heading (the `5.4.`/`## Discussion` case). C1-C2,
+   ships independently. See TRIAGE "Cycle 14 (investigation)".
+2. **FIG caption double-emission + truncation** — ~8 papers. S2, C2.
+3. **G5c-2 + G5d + TABLE** — the section-partitioner cluster, C3, a dedicated
+   session: G5c-2 (split-heading rejoin — pdftotext splits `N.N. Title` and the
+   partitioner consumes the title word; 5 of 6 jdm_m2 cases), G5d (named/unnumbered
+   heading demotion, ~7 papers), TABLE structure destruction (~11 papers, the single
+   largest blocker).
+4. **COL column-interleave** (incl. `test_request_09`'s numbered-bibliography split)
+   and **GLYPH** deleted-minus — S0, C3-C4, layout-channel; escalate.
+`test_request_09` will stay red until the COL class is fixed — it is a correct
+regression test catching a real defect, not a stale fixture.
+## Process notes / improvements
+- The `codex` CLI Windows UTF-8 bug (above) is article-finder's `gold-generation.md`
+  Step-4 to fix — report filed in the ArticleRepository handoffs dir.
+- `ai-full-doc-verify.md` Step 1c now states the codex prerequisite (committed).
+- Cross-skill lesson re-confirmed twice this run
+  (`reproduce-triage-defect-at-head-before-trusting-cost-estimate`): both G5b and
+  G5c were costed wrong in the TRIAGE — G5b deeper (guard removal, not 5→8), G5c
+  deeper (partitioner, not a render fold). Always reproduce + measure at HEAD.
+## Stop reason
+Run 4 completed JOB 1, JOB 2, and 2 of JOB 3's items (tests-regen + cycle 13), and
+re-scoped G5c. Stopped before the remaining JOB 3 cycles because they are a
+fresh-session-sized block of C2-C3 section-partitioner work (G5c-2 / G5d / TABLE)
+plus escalated C3-C4 layout-channel work (COL / GLYPH) — continuing to grind them in
+an already-very-long session is the wrong call. The next `/docpluck-iterate` session
+resumes at G5c-1 from the queue above.

docpluck-2.4.47/docs/HANDOFF_2026-05-16_iterate_run_4_fix_and_continue.md ADDED Viewed

@@ -0,0 +1,157 @@
+# Handoff — docpluck-iterate run 4: fix-and-continue (fresh session)
+**Authored:** 2026-05-16, end of session 3. **For:** a fresh `/docpluck-iterate` session.
+**Read this whole file before touching anything.** It is self-contained — it assumes no memory of session 3.
+You have **three jobs, in this order**:
+1. **JOB 1 — Resolve the in-flight cycle 12 (ligature fix).** It is committed-nowhere, sitting uncommitted in the working tree, and it is **broken** — it introduced a test regression and duplicates an existing normalize step. Decide: rework it, or revert it. **Do not commit it as-is.**
+2. **JOB 2 — Finish the article-finder AI-gold integration.** ArticleFinder shipped its side and left docpluck a punch-list (`docs/handoffs/` — see §JOB 2). docpluck's session-3 fix (commit `ac34c7e`) did part of it; the rest is open.
+3. **JOB 3 — Continue the APA iteration loop** from the TRIAGE punch-list.
+Invoke the `docpluck-iterate` skill normally (it runs its own preflight). Then work these three jobs.
+---
+## 0. Immediate git / working-tree state
+- Repo: `C:\Users\filin\Dropbox\Vibe\MetaScienceTools\docpluck`, branch `main`.
+- Last commit: `5cc321a docs: add AI-gold instructions from article-finder coordination`.
+- Recent history (all session 3, all clean, all prod-deployed): `bbad28f` v2.4.43 (cycle 11), `ac34c7e` skill-fix (gold delegation), `9b41e4d` v2.4.42 (cycle 10), `951b00a` v2.4.41-ish… — i.e. **v2.4.43 is the last shipped library version.**
+- **Uncommitted working tree = the cycle-12 ligature attempt (v2.4.44).** Modified: `docpluck/normalize.py`, `docpluck/render.py`, `docpluck/tables/cell_cleaning.py`, `docpluck/__init__.py`, `pyproject.toml`, `CHANGELOG.md`, `docs/TRIAGE_2026-05-14_phase_5d_gold_audit.md`, `.claude/skills/docpluck-iterate/LEARNINGS.md`, `.claude/skills/_project/lessons.md`; untracked new file `tests/test_ligature_decomposition_real_pdf.py`.
+- 26-paper baseline at the cycle-12 working tree: **26/26 PASS, 0 WARN** (re-confirmed).
+- Broad pytest at the cycle-12 working tree: **16 failed / 1233 passed**. 15 of the 16 are the long-standing pre-existing set (12× `test_extract_pdf_byte_identical` snapshot drift + 2× `test_sections_golden` + 1× `test_request_09`). **The 16th is cycle-12-introduced** — see JOB 1.
+---
+## JOB 1 — Cycle 12 (ligature decomposition) is BROKEN. Rework or revert.
+### What cycle 12 attempted
+Goal: decompose Latin typographic ligatures (`ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ`, U+FB00-FB06) — a corpus scan found them in 35 rendered `.md` files (`korbmacher` 82×, `jdm_.2023.16` 34×). The attempt added a new `normalize.py::decompose_ligatures` helper (per-char NFKC scoped to `[ﬀ-ﬆ]`) and wired it into three channels: `normalize_text` body (right after the NFC step, ~line 1567), `tables/cell_cleaning._html_escape`, and `render_pdf_to_markdown` post-process. Bumped to v2.4.44, NORMALIZATION_VERSION 1.9.8.
+### Why it is broken
+**docpluck `normalize.py` ALREADY HAS a ligature-expansion step** — `S3_ligature_expansion` at **`normalize.py` ~line 1687**:
+```python
+t = t.replace("ﬀ", "ff")   # ﬀ
+t = t.replace("ﬁ", "fi")   # ﬁ
+t = t.replace("ﬂ", "fl")   # ﬂ
+t = t.replace("ﬃ", "ffi")  # ﬃ
+t = t.replace("ﬄ", "ffl")  # ﬄ
+report._track("S3_ligature_expansion", before, t, "ligatures_expanded")
+```
+The cycle-12 `decompose_ligatures(t)` call was inserted EARLY in `normalize_text` (~line 1567, just after the NFC step) — it consumes every ligature **before** S3 runs. So S3 now finds nothing, tracks `ligatures_expanded = 0`, and `tests/test_normalization.py::TestFullPipeline::test_report_tracks_changes` fails:
+```
+raw = "signiﬁcant eﬀect −0.73"
+assert report.changes_made.get("ligatures_expanded", 0) > 0   # -> 0, FAIL
+```
+That is the 16th pytest failure. **Cycle 12 starved a pre-existing step.**
+### The real question to answer first
+If `S3_ligature_expansion` already expands ligatures in the normalize body channel, **why did 35 rendered papers still show raw `ﬁ`/`ﬂ` glyphs?** Cycle 12 was triggered by that observation. Possible causes — INVESTIGATE before reworking:
+- The rendered `.md` body may come from a path/level that does not run S3 (check what normalization level `render_pdf_to_markdown` applies to body text, and whether `preserve_math_glyphs` or a `NormalizationLevel` branch skips S3).
+- S3 only covers FB00-FB04 — it does **not** handle `ﬅ` (FB05) / `ﬆ` (FB06). Those would survive S3.
+- The ligatures in the render may come from channels that genuinely bypass `normalize_text` entirely: Camelot table cells (`cell_cleaning`), figure/table captions, `unstructured-table` fenced blocks, `raw_text` fallbacks.
+### Correct rework (recommended)
+1. **Remove** the early `decompose_ligatures(t)` call from `normalize_text` (~line 1567). The body channel already has S3 — do not starve it.
+2. If S3's FB00-FB04-only coverage is the gap, **extend the existing S3 step** to also map `ﬅ`/`ﬆ`→`st` (and keep its `report._track` call intact).
+3. The `cell_cleaning` + `render` post-process `decompose_ligatures` calls are probably the genuine fix (those channels bypass S3) — **verify** by rendering `korbmacher` / `jdm_.2023.16` with ONLY those two calls (no `normalize_text` call) and confirming 0 residual ligatures. Keep them if they close a real gap; the shared helper is fine to keep for those two channels.
+4. Re-confirm `test_report_tracks_changes` passes, run the 26-paper baseline, AI-verify (see JOB 2 — golds now come from article-finder).
+5. **Fix the stale narrative:** the uncommitted `CHANGELOG.md`, `LEARNINGS.md`, `_project/lessons.md`, and `TRIAGE` entries for cycle 12 currently claim a clean *new* fix. They are wrong — cycle 12 duplicated S3. Correct them to describe the actual rework before committing.
+**Alternative — clean revert:** `git checkout -- docpluck/ pyproject.toml CHANGELOG.md docs/TRIAGE_2026-05-14_phase_5d_gold_audit.md .claude/skills/docpluck-iterate/LEARNINGS.md .claude/skills/_project/lessons.md` and `rm tests/test_ligature_decomposition_real_pdf.py`, then redo ligature coverage as a fresh, correctly-scoped cycle once the S3-reach question above is answered. This is cleaner than fix-forward if the investigation shows the body path was fine all along.
+**Do NOT** ship v2.4.44 until `test_report_tracks_changes` passes and the diff is a genuine, non-duplicate fix.
+---
+## JOB 2 — Finish the article-finder AI-gold integration
+ArticleFinder's full instruction is at:
+`C:\Users\filin\Dropbox\Vibe\ArticleRepository\docs\handoffs\2026-05-16_docpluck_ai-gold-instructions.md`
+**Read it in full.** Summary of what docpluck still owes:
+**Already done in session 3 (commit `ac34c7e`):** removed docpluck's private gold-extraction prompt from `references/ai-full-doc-verify.md` (Step 1b); rewrote Step 1 to delegate generation to `article-finder generate-gold`; updated `SKILL.md`, `CLAUDE.md`, `docpluck-qa/SKILL.md`; saved memory `feedback_gold_generation_via_article_finder`.
+**Still open — audit `ac34c7e` against ArticleFinder's handoff and adjust:**
+1. **Canonical DOI keys are now HARD-ENFORCED.** `ai-gold.py register-view` / `migrate` now *reject* a bare local stem (`chen_2021_jesp`) with an error. Confirm the docpluck-iterate skill (`ai-full-doc-verify.md` "Choosing $KEY" + Phase 5d) resolves the paper's DOI via `ai-gold.py resolve` and passes the DOI — not a stem. If the autonomous loop ever keys by a stem it will now fail loudly.
+2. **`codex` CLI cross-model verification.** `gold-generation.md` now runs an independent Codex / GPT-5.5 model to audit every gold before storage; `generate-gold` blocks without it. `codex-cli 0.128.0` IS installed in this environment — **verify it is authenticated** (`codex --version` works; run `codex login` if calls fail). Note this dependency in the docpluck-iterate skill so a future run does not stall mystifyingly.
+3. **Regenerate the stale `reading` golds.** docpluck's existing cached `reading` golds were produced by the old private prompt and diverge from `gold-generation.md`. Regenerate via `article-finder generate-gold <pdf>`. **Priority — the 3 fragmented papers first:**
+   | Paper | old stem key | canonical DOI key |
+   |---|---|---|
+   | Chen 2021 JESP | `chen_2021_jesp` | `10.1016__j.jesp.2021.104154` |
+   | Xiao 2021 CRSP | `xiao_2021_crsp` | `10.1080__23743603.2021.1878340` |
+   | Efendic 2022 SPPS | `efendic_2022_affect` | `10.1177__19485506211056761` |
+   Then regenerate the rest of docpluck's `reading` golds. After each, confirm with `ai-gold.py views <doi>`.
+4. Run `python ~/.claude/skills/article-finder/ai-gold.py audit` — expect **0 issues**. Coordinate the removal of the old short-stem cache directories with article-finder (do not delete cache data unilaterally — article-finder owns the cache repo's commits).
+**Important caveat for JOB 3:** cycles 8-12's Phase-5d verification consumed the *stale* `reading` golds (`tmp/*_gold.md`). The shipped fixes (v2.4.40-43) are still sound — they are keyed on structural signatures and gated by the 26-paper baseline — but once the golds are regenerated, **re-run the Phase-5d verifier for at least efendic / chen / xiao / jdm_m.2022.2 against the fresh golds** to confirm nothing was missed.
+---
+## JOB 3 — Continue the APA iteration loop
+Work queue: `docs/TRIAGE_2026-05-14_phase_5d_gold_audit.md`, section **"SESSION-3 STANDING VERDICT"**. The APA corpus is **NOT clean — ~12 papers still FAIL** Phase-5d on pre-existing defects. Per rule 0e-bis the run continues. Ranked next pickups:
+1. **G5b — long-descriptive-title prose guard** (S1, C1, cheap). `render.py::_promote_numbered_subsection_headings` and `_promote_numbered_section_headings` reject headings whose title has a run of ≥5 lowercase-initial words (`max_lc_run >= 5`). This over-rejects legitimate long numbered headings (`4. Knowledge acquisition, decision delay, and choice outcomes`; `2.4.2.2. Inference of planning strategies and strategy types`). For a *numbered* line that already passed the strict regex + (for section-level) the numbering-range/uniqueness/list-adjacency gates, the lc-run guard is near-redundant. Raise the threshold 5→8 in both promoters. ~25 headings, mostly `jdm_.2023.16`. This was the planned cycle 13.
+2. **G5c — split-line numbered headings** (S1, C2). `5.3.`\n\n`Results` — the number alone on a line, the title on the next; renders as an orphan bare-number line, and the content gets a MISLABELED generic `## Results` instead of `### 5.3. Results`. The cycle-3 orphan-arabic-numeral folder's multi-level analogue.
+3. **FIG caption double-emission + truncation** (S2, C2) — ~8 papers.
+4. **G5d — named (unnumbered) heading demotion** (S1, C2-C3) — ~7 papers; section-partitioner work, largest false-positive surface.
+5. **TABLE structure destruction** (S0/S1, C3) — ~11 papers, the single largest blocker; needs a render/structured-extraction coordination *design* — a dedicated session.
+6. **COL column-interleave** (S0, C3) and **GLYPH 011 deleted-minus / efendic `Mchange` no-CI** (S0, C3-C4) — escalate; need the layout channel.
+Also queued: a `tests:` regen cycle for the 15 pre-existing pytest failures (all snapshot drift — see `HANDOFF_2026-05-16_iterate_apa_run_3.md` §4). Triage each with a git-stash round-trip before regenerating.
+**Iteration discipline (unchanged, non-negotiable):** one defect class per cycle; every fix keyed on a structural signature (never paper identity); 26/26 baseline is the no-regression gate; AI-gold Phase-5d verify every affected paper (gold OBTAINED from article-finder, never self-generated — JOB 2); add a real-PDF regression test in the same cycle; ship incrementally (tagged release per cycle); never report "clean" while corpus FAILs remain (rule 0e-bis).
+---
+## Run context — what session 3 shipped (cycles 8-11, all clean)
+| Cycle | Version | Fix |
+|---|---|---|
+| 8 | v2.4.40 | standalone `2`-for-U+2212 minus recovery via point-estimate∈CI pairing (GLYPH, S0) |
+| 9 | v2.4.41 | numbered subsection-heading regex loosened (trailing dot + internal colon); ~78 `###` headings recovered (G5, S1) |
+| 10 | v2.4.42 | Elsevier page-1 footer (e-mail + ISSN lines) strip (D4, S2) |
+| 11 | v2.4.43 | single-level numbered section-heading promotion (G5a, S1) |
+| — | `ac34c7e` | process fix: gold generation delegated to article-finder (JOB 2 partial) |
+| 12 | (v2.4.44 attempt) | ligature decomposition — **BROKEN, see JOB 1** |
+Each of cycles 8-11: 26/26 baseline, 0 new pytest failures, AI-gold verifier OVERALL PASS, real-PDF test added, prod-deployed. Full detail: `docs/HANDOFF_2026-05-16_iterate_apa_run_3.md` and `.claude/skills/docpluck-iterate/LEARNINGS.md`.
+`run-meta` (`~/.claude/skills/_shared/run-meta/docpluck-iterate.json`) was left mid-run (verdict blank, `postflight_heartbeat:false`). The fresh session's preflight will re-init it; the session-3 postflight was **not** run (this handoff replaces it). If you want the session-3 signal preserved, note that `bugs_fixed`/`tests_added`/`lessons_appended` arrays in that file already hold cycles 8-12's entries.
+## Command cheat-sheet
+```
+# 26-paper baseline (the no-regression gate)
+PYTHONUNBUFFERED=1 python -u scripts/verify_corpus.py 2>&1 | awk '{print; fflush()}'
+# broad pytest (camelot off)
+DOCPLUCK_DISABLE_CAMELOT=1 python -u -m pytest tests/ -q --tb=line
+# render one paper
+python -c "from docpluck.render import render_pdf_to_markdown; from pathlib import Path; print(render_pdf_to_markdown(Path('<pdf>').read_bytes()))"
+# AI gold — OBTAIN from article-finder, never self-generate
+python ~/.claude/skills/article-finder/ai-gold.py resolve "<doi>"
+python ~/.claude/skills/article-finder/ai-gold.py check <doi-key> --view reading
+python ~/.claude/skills/article-finder/ai-gold.py get   <doi-key> --view reading
+#   on a miss: invoke the article-finder skill -> generate-gold <absolute-pdf-path>
+python ~/.claude/skills/article-finder/ai-gold.py audit      # expect 0 issues
+# prod health
+curl -s https://extraction-service-production-d0e5.up.railway.app/_diag | python -m json.tool
+```
+APA test PDFs: `../PDFextractor/test-pdfs/apa/`. Library version files to bump together: `docpluck/__init__.py`, `pyproject.toml`, `docpluck/normalize.py::NORMALIZATION_VERSION` (only if normalize.py changed).

docpluck 2.4.45__tar.gz → 2.4.47__tar.gz

docpluck 2.4.45tar.gz → 2.4.47tar.gz