PyPI - docpluck - Versions diffs - 2.4.85__tar.gz → 2.4.90__tar.gz - Mend

docpluck 2.4.85tar.gz → 2.4.90tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (408) hide show

{docpluck-2.4.85 → docpluck-2.4.90}/.claude/skills/_project/canary.json RENAMED Viewed

@@ -9,6 +9,7 @@
         "stem": "ip_feldman_2025_pspb",
         "key": "10.1177/01461672251327169",
         "doi": "10.1177/01461672251327169",
+        "expected_pdf_sha": "8680a4091648fcac0be6a9f80b2ace8faa5690bd96a84a527d6c5839d312cf58",
         "rationale": "Triggered the 2026-05-23 iterate-loop spine. Single paper exercising 4 distinct defect classes: G5d hallucinated subsection headings (Supplemental Materials mid-Method), B3 affiliation/running-header leak into body, B4 mid-text Table 3 caption, running-header 'Ip and Feldman' surfacing as a paragraph. The litmus test for whether the library's structural-defect cluster is closed.",
         "max_gold_age_days": 30
       },
@@ -16,6 +17,7 @@
         "stem": "plos_med_1",
         "key": "10.1371/journal.pmed.1004323",
         "doi": "10.1371/journal.pmed.1004323",
+        "expected_pdf_sha": "e6b4cea5767d1ec1eccdd8e448e6f9a6cab57ef740ae1f3fac6c59cddf6dbfd5",
         "rationale": "B1 text_loss canary — Tables 2-5; Table 5 has 13 SAE rows lost. The 1 still-failing Tier-D cell as of run 9 close. Forces the table-completeness story to stay covered every cycle. The architectural decision around modified-Approach-B bbox computation has been outstanding since 2026-05-22 — this canary keeps it visible.",
         "max_gold_age_days": 30
       },
@@ -23,6 +25,7 @@
         "stem": "chandrashekar_2023_mp",
         "key": "10.15626/MP.2022.3108",
         "doi": "10.15626/MP.2022.3108",
+        "expected_pdf_sha": "84ff52f5a22089ffd0bf7f37b47040e91bf7fafd7808265fb82866e419284a0e",
         "rationale": "B6 column-interleave canary — pdftotext two-column reading order serialises a multi-column page interleaved. Distinct defect class from ip_feldman / plos_med_1 (extraction-layer, not normalize-layer). Keeps the column-interleave architectural decision (R4 in 2026-05-22-residual handoff) from being silently forgotten.",
         "max_gold_age_days": 30
       }
@@ -32,12 +35,14 @@
         "stem": "chan_feldman_2025_cogemo",
         "key": "10.1080/02699931.2024.2434156",
         "doi": "10.1080/02699931.2024.2434156",
+        "expected_pdf_sha": "5661fe9039250dd2885a616253000662e1337b03eb6457ead286be30ffd7cf54",
         "exercises": "B2 hallucinated/demoted headings; B6 column interleave (Measures section)"
       },
       {
         "stem": "ar_apa_j_jesp_2009_12_011",
         "key": "10.1016/j.jesp.2009.12.011",
         "doi": "10.1016/j.jesp.2009.12.011",
+        "expected_pdf_sha": "69b2795c04ef1491b5cfc295be28197cc83b9579830f6fdd1ab8f1549314ce31",
         "exercises": "B2 hallucinated headings (Participants/Overview); B7 deleted-minus glyph (b = .022 → should be -0.022)"
       }
     ],
@@ -55,7 +60,7 @@
     "gold_view": "reading",
     "locator_via_article_finder": "python ~/.claude/skills/article-finder/cache-check.py <doi>",
     "gold_via_article_finder": "python ~/.claude/skills/article-finder/ai-gold.py get <key> --view reading",
-    "render_command": "python tools/render_for_audit.py --key {key} --out {out}",
+    "render_command": "py -3 tools/render_for_audit.py --key {key} --out {out}",
     "allowed_omissions": [
       "Front-matter masthead block between the title (H1) and the Abstract: author name line(s), affiliation line(s), corresponding-author contact line, journal-name banner, volume/issue/page range, 'Article type', publisher/copyright line, 'Article reuse guidelines', journal-home URL. docpluck strips this masthead by design (render.py _strip_frontmatter_masthead_block); consumers needing structured author metadata use CrossRef/DOI lookup. Its absence from the rendered .md is NOT TEXT-LOSS or METADATA-LEAK.",
       "Publication-history date lines such as 'Received <date>; revision accepted <date>' or 'Received/Revised/Accepted' mastheads — stripped as journal furniture (normalize.py _PAGE_FOOTER_LINE_PATTERNS).",

{docpluck-2.4.85 → docpluck-2.4.90}/.claude/skills/_project/lessons.md RENAMED Viewed

@@ -575,3 +575,13 @@ Rotation picks `pool[(N mod L) : (N mod L) + rotation_size]` wrapping. Over `cei
 **What:** `M_age 59.3` rendered as `39.3` on collabra.77859. The PDF VISUALLY shows `59.3`, but the embedded text codepoint is baked as `3` — **both pdftotext AND pdfplumber extract `39.3`**. This is the `Västfjäll→Vastfall` class, but a DIGIT in a statistic (silent stat corruption — the most dangerous form for meta-science).
 **How to diagnose:** 3-way check — pdftotext vs pdfplumber vs a visual/AI read of the PDF. When both deterministic extractors agree on the wrong value and only the visual disagrees, the codepoint is baked wrong and NO text-channel logic can fix it (recovery needs OCR / multimodal-glyph-consensus, a new subsystem). User decision 2026-06-08: **document as a known limitation, do not scope an OCR subsystem.** Consumer guidance: downstream stat-checkers (CitationGuard) must cross-verify digits against CrossRef/visual — docpluck cannot guarantee a digit matches the visual glyph when the publisher baked the wrong codepoint.
+## Dropped-glyph recovery splits into layout-recoverable vs pixel-only — probe per-instance before designing (2026-06-15, v2.4.89 W0h)
+**What:** A glyph that **pdftotext drops entirely** (emits nothing) is NOT one class but two, and a per-*instance* 3-way diff tells them apart:
+1. **Layout-recoverable** — pdfplumber still sees the glyph (as an unmapped `(cid:N)` in a symbol font); the text channel lost it, the layout channel kept it. → Recover from the layout channel. `ar_apa_j_jesp_2009_12_011`: 3 of 4 negative betas (`-.022 / -.88 / -.428`) carry their U+2212 as `(cid:2)` in font `AdvP4C4E74`. `normalize.recover_dropped_minus_via_layout` (W0h) re-inserts the minus in the `<stat> = <minus><coef>` operator slot.
+2. **Pixel-only** — pdfplumber ALSO drops it (absent from chars/lines/rects/curves AND pdfminer's raw LTChar/LTImage layer); the minus is painted pixels only — same OCR floor as the baked-glyph lesson above. `ar_apa` `β = -.245`: unrecoverable in text+layout. **User decision 2026-06-15: ship the layout-recoverable subset, document the pixel-only residual, do NOT build an OCR tier.**
+**Trap:** "the layout channel recovers what pdftotext drops" is true only for sub-case 1. Probe the SPECIFIC failing instances (geometry: is there a char/line/rect immediately left of the number?) before assuming a layout fix is complete — feasibility here was 3/4, and a recovery that silently fixes 3 of 4 sign-flips is a product call (false-confidence), surfaced to the user.
+**Plumbing gotcha (cost a detour):** the section/render path calls `normalize_text` WITHOUT `layout=` (`sections/__init__.py`), so F0 and every layout-gated pass is OFF there by design (text-channel-only contract). A layout-aware fix must thread a **dedicated** param (`dropped_minus_layout`, `render → extract_sections → normalize_text`) — reusing the `layout=` gate would also switch F0 on and risk broad regressions. The detector must cluster chars into lines by **y-overlap**, never `round(top)` (a minus sits ~0.4pt off its digits' baseline and rounds into a different bucket, orphaning it).

{docpluck-2.4.85 → docpluck-2.4.90}/.claude/skills/docpluck-deploy/SKILL.md RENAMED Viewed

@@ -108,7 +108,7 @@ fi
 echo "✅ Library version sync OK"
 ```
-### 4. Verify Vercel Environment Variables
+### 5. Verify Vercel Environment Variables
 ```bash
 cd frontend && vercel env ls
 ```
@@ -125,7 +125,7 @@ Required variables (all must show as "Encrypted"):
 If any are missing, refer to SETUP_GUIDE.md.
-### 5. SES Environment Variables Present in Vercel Production (CRITICAL)
+### 6. SES Environment Variables Present in Vercel Production (CRITICAL)
 Same pattern as check 4 (Vercel env vars) but for the SES + notification surface. Missing any of these means the app boots in production with a broken email path (queued notifications, no send; webhook signature checks fail; admin alerts silently dropped).
@@ -141,7 +141,7 @@ done
 **Gate:** all listed vars present (Encrypted). Any missing = FAIL the deploy.
-### 6. SES Identity SUCCESS
+### 7. SES Identity SUCCESS
 ```bash
 aws sesv2 get-email-identity --email-identity mail.docpluck.app --region eu-west-1 \
@@ -150,7 +150,7 @@ aws sesv2 get-email-identity --email-identity mail.docpluck.app --region eu-west
 **Gate:** must equal `SUCCESS`. Anything else = FAIL the deploy.
-### 7. DKIM + SPF + DMARC DNS Resolve
+### 8. DKIM + SPF + DMARC DNS Resolve
 ```bash
 # DKIM token list — read live from SES, do NOT hard-code:
@@ -322,8 +322,14 @@ curl -s -w '\nHTTP %{http_code}\n' \
 If deployment fails:
 ```bash
-# Vercel: rollback to previous deployment
-cd frontend && vercel rollback
+# Vercel: roll back to a SPECIFIC known-good deployment (don't rely on bare
+# `vercel rollback`, which targets whatever happens to be the previous deploy
+# and is ambiguous when several deploys landed close together).
+cd frontend
+vercel ls                          # find the last URL that showed Ready + healthy
+vercel rollback <deployment-url>   # e.g. https://pdfextractor-abc123.vercel.app
+# (bare `vercel rollback` = roll back to the immediately previous deployment;
+#  only use it when you are certain that is the known-good target.)
 # Railway: redeploy from last working commit
 railway service extraction-service

{docpluck-2.4.85 → docpluck-2.4.90}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -1174,3 +1174,27 @@ the `SKIP_CANARY=1` justification when the `claude`-less pre-commit hook hard-fa
 (pdfplumber/camelot/pytest all had to be pip-installed mid-run into C:\Python314).
 Stand up a proper env before an iterate run so the harness + camelot table tests
 aren't skipped.
+---
+## Run: 2026-06-15 · cycle 1 · B7 layout-channel dropped-minus recovery · v2.4.88 → v2.4.89
+### Outcome
+- **B7 (the dropped-minus-with-no-CI residual) FIXED** for the layout-visible subset. `normalize.recover_dropped_minus_via_layout` (W0h) recovers `ar_apa` `β = -.022 / -.88 / -.428`; leaves `.48` positive. Shipped v2.4.89. Surgical blast radius (only ar_apa flips across the 5 onboarded canary papers).
+- User-approved scope decision (surfaced mid-cycle): ship the layout-visible subset + document the OCR-only residual; do NOT build an OCR tier this run.
+### Blind Spots
+- **"The layout channel can recover what pdftotext drops" has a HARD FLOOR: pixel-only glyphs.** `ar_apa` `β = -.245` is drawn as painted pixels — its minus is absent from pdftotext AND pdfplumber chars/lines/rects/curves AND pdfminer's raw LTChar/LTImage layer. Before designing ANY layout-recovery, probe the SPECIFIC glyph instances (not just "does the layout see a minus somewhere") — feasibility can be 3/4, not 4/4, and a recovery that silently fixes 3 of 4 sign-flips is a product decision (false-confidence risk), not an automatic win. Confirm per-instance with a geometry probe.
+- **The section/render path calls `normalize_text` WITHOUT `layout=`** (`sections/__init__.py:75`), so F0 — and any layout-gated pass — is OFF in the main render path by design (the section pipeline is deliberately text-channel-only). A layout-aware fix CANNOT just be added behind the existing `layout` gate; it must thread a DEDICATED param (`dropped_minus_layout`) so it doesn't also switch F0 on (which would risk broad regressions). Non-obvious; cost a detour to discover.
+- **A TRIAGE data-gap claim can be STALE.** `TRIAGE_2026-06-15` said `plos_med_1` has "no reading gold / BLOCKED". `ai-gold.py get` returned a 42KB gold (28d old). Always re-check `ai-gold.py get <key>` before recording BLOCKED-NEEDS-GOLD — the gold may exist. (Also: the gate's I11 then flagged a key-form mismatch warning for plos — the gold may be under a non-canonical key dir; worth a rekey check.)
+### Edge Cases
+- **Group layout chars into lines by y-OVERLAP, never `round(top)`.** A minus glyph sits ~0.4pt off its digits' baseline; `round(top)` put `.88`'s `(cid:2)` in top-bucket 353 while its digits were in 352, orphaning it (the first impl recovered `.022` but missed `.88`). Cluster by `bottom > top+ε and top < bottom-ε` (y-overlap) like `extract_layout._chars_to_spans` does.
+- **pdfplumber reports the glyph EM-box, not ink-box**, so bbox HEIGHT is identical for minus/hyphen/period (≈ font ascent-descent) — you cannot shape-discriminate a minus by vertical ink position. Width does differ (minus 6.3 vs hyphen 3.3 vs period 2.0 at size 8). So the safe detector keys on CONTEXT (unmapped `(cid:N)` in the `= <coef>` operator slot + text-lacks-minus), not glyph shape.
+### Verification Gaps
+- **Camelot non-determinism on Windows changes the rendered byte count run-to-run** (ar_apa 27038B vs 27507B across two identical renders — a table present in one, absent in the other). This makes `rendered_sha` unstable for I10 and can flip a paper's table findings. The body-prose betas are stable, but table-bearing verification needs a deterministic Camelot or a retry/normalize step before sha-pinning.
+- **The full `pytest -n auto` (xdist) background run died silently on Windows** (0 output, no python process, no completion notification) — the exact "backgrounded long task dies" failure the portfolio rule warns about. Serial `pytest -q` with an explicit `PYTEST_DONE_EXIT_$?` marker is the reliable pattern here; verify liveness (process + output growth), never infer it.
+### Process notes
+- One AI-verify cycle surfaced THREE pre-existing defect classes beyond the target: RC-1 two-column interleave (ip_feldman/chandrashekar/chan_feldman/ar_apa-table), B1 table-completeness (plos Tables 2/3/4/5 lose rows/cols/bodies), and metadata-leak (plos affiliations/abbrev/running-headers — an RC-2 residual). Per 0e-bis the run's standing verdict stays FAIL; cycle 1 is an incremental ship, not a clean PASS. Surfaced to user as the run punch-list.

{docpluck-2.4.85 → docpluck-2.4.90}/.claude/skills/docpluck-iterate/SKILL.md RENAMED Viewed

@@ -34,7 +34,7 @@ Reference handoff: `docs/superpowers/handoffs/2026-05-25-canary-audit-architectu
 1. Run: `bash ~/.claude/skills/_shared/bin/preflight-filter.sh docpluck-iterate` and print its `🔧 skill-optimize pre-check · ...` heartbeat as your first visible output line.
 2. Initialize `~/.claude/skills/_shared/run-meta/docpluck-iterate.json` per `~/.claude/skills/_shared/preflight.md` step 6 — including the **iterate-skill extension fields** (`project_root`, `current_cycle`, `cycle_status`, `cycle_targets`, `phase_5d_runs`, `corpus_sweeps`, `open_findings`, `iterate_skips`, `run_closeout`). Set `project_root` to the absolute path of the docpluck repo so `iterate-gate.sh` can find the canary file when running outside a git CWD.
 3. Load `~/.claude/skills/_shared/quality-loop/core.md` into working memory. Although `docpluck-iterate` is not a `-qa`/`-review`/`-cleanup`/`-deploy` skill itself, it ORCHESTRATES those skills cycle-by-cycle and the spine rules R1–R5 apply transitively. Treat any FAIL from those orchestrated skills as a phase failure for this skill.
-4. **Load the iterate-loop spine** — read `~/.claude/skills/_shared/iterate-loop/core.md` and hold rules I1–I7 in working memory. Confirm `<docpluck-repo>/.claude/skills/_project/canary.json` exists (it MUST; missing canary = immediate gate fail). This spine — not prose in this file — is what makes the cycle/run discipline foolproof. Rules I1 / I2 / I3 / I5 / I7 are checked after every cycle by `iterate-gate.sh --cycle N`; rules I4 / I6 are checked at run-close by `iterate-gate.sh --close`. A non-zero exit BLOCKS the corresponding write (cycle PASS, run closeout). See "Iterate-loop spine integration" below for the exact call sequence.
+4. **Load the iterate-loop spine** — read `~/.claude/skills/_shared/iterate-loop/core.md` and hold rules **I1–I12** in working memory (the spine grew past I7 — I8 gate-was-called, I9 locator-via-article-finder, I10 artifact-existence, I11 gold-sha-matches-cache, I12 lesson-readback; see the table below for the authoritative `--cycle`/`--close` split, mirrored from `core.md`). Confirm `<docpluck-repo>/.claude/skills/_project/canary.json` exists (it MUST; missing canary = immediate gate fail). This spine — not prose in this file — is what makes the cycle/run discipline foolproof. The **per-cycle** rules (I1, I2, I3, I4, I5, I7, I9, I10, I11, I12-present) are checked after every cycle by `iterate-gate.sh --cycle N`; the **run-close** rules (I6, I8, I12-relevance) are checked by `iterate-gate.sh --close`. A non-zero exit BLOCKS the corresponding write (cycle PASS, run closeout). See "Iterate-loop spine integration" below for the exact call sequence.
 If you skip these steps, the postflight heartbeat will be missing and the run will produce no learning signal — defeating the whole point of a self-improving loop.
@@ -46,15 +46,20 @@ This skill — like every `<prefix>-iterate` and `<prefix>-fix` skill across the
 **What the spine enforces (full text in `~/.claude/skills/_shared/iterate-loop/core.md`):**
-| Rule | Hard check |
-|---|---|
-| **I1** phase-5d-actually-ran | Cycle must record ≥1 `phase_5d_runs` entry |
-| **I2** canary-coverage | Cycle must AI-verify every (target ∪ canary) paper |
-| **I3** verdict-on-truth-not-proxy | Cycle PASS requires every `phase_5d_runs` verdict == PASS |
-| **I4** blocked-gold-is-a-cycle-status | BLOCKED-NEEDS-GOLD is a legal cycle status; blocks run-close (not cycle) |
-| **I5** corpus-sweep-not-stale | Sweep must have run within last 3 cycles |
-| **I6** no-open-canary-findings-at-close | Run-close requires zero open canary findings + zero BLOCKED canary papers |
-| **I7** deterministic-metric-is-not-a-verdict | Idempotency / char-ratio / snapshot diff are INPUTS to I3, not substitutes for it |
+| Rule | When | Hard check |
+|---|---|---|
+| **I1** phase-5d-actually-ran | `--cycle` | Cycle must record ≥1 `phase_5d_runs` entry |
+| **I2** canary-coverage | `--cycle` | Cycle must AI-verify every (target ∪ canary) paper |
+| **I3** verdict-on-truth-not-proxy | `--cycle` | Fires on ANY `phase_5d_runs` verdict ∈ {FAIL, STALE_GOLD} — does not wait for a claimed PASS |
+| **I4** blocked-gold-is-a-cycle-status | `--cycle` | BLOCKED-NEEDS-GOLD is a legal cycle status (listed as a warning); blocks run-close (not cycle) |
+| **I5** corpus-sweep-not-stale | `--cycle` | Sweep must have run within last 3 cycles |
+| **I6** no-open-canary-findings-at-close | `--close` | Run-close requires zero open canary findings + zero BLOCKED canary papers (MUST-with-override) |
+| **I7** deterministic-metric-is-not-a-verdict | `--cycle` | Idempotency / char-ratio / snapshot diff are INPUTS to I3, not substitutes for it |
+| **I8** gate-was-actually-called | `--close` | Every cycle 1..current must have invoked `iterate-gate.sh --cycle N` (no skipped gate) |
+| **I9** locator-via-article-finder | `--cycle` | All paper/gold location + retrieval goes through article-finder (`locator_via`/`gt_via` ∈ cache-check/corpus-query/ai-gold.get/ai-gold.check/generate-gold) — never direct `find`/`glob`/path reads of `test-pdfs/` or `ai_gold/` |
+| **I10** artifact-existence | `--cycle` | Each `phase_5d_runs` entry's rendered file must exist and its sha match `rendered_sha` |
+| **I11** gold-sha-matches-cache | `--cycle` | Each entry's `gold_sha` must match the sha of the cached gold it was verified against |
+| **I12** lesson-readback | `--cycle` / `--close` | `--cycle`: a this-run `lesson_readback` trail must exist (MUST). `--close`: each surfaced card colliding with `files_touched`/`commands_run` must be in `lessons_applied` (MUST-with-override) |
 **The two mandatory gate calls (cannot be skipped):**
@@ -182,94 +187,7 @@ Before any iteration, establish:
 **Per-cycle self-check (Phase 9 / Verification Checklist):** every cycle must be able to answer "what did I parallelize via subagents this cycle, and what did I do inline that could have been parallel?" If the honest answer is "I did N independent things serially," that is a logged process miss — fix it next cycle.
-Iteration is dominated by I/O + AI work that is naturally parallel across papers. The orchestrator (this skill) MUST aggressively fan out to `Agent` subagents whenever there are 2+ independent units of work, when the work passes the safety checklist below.
-### Safety checklist — must pass ALL 4 before parallelizing
-1. **No shared file state.** Each parallel unit must write to a distinct output path (e.g., `tmp/<paper>_gold.md` for paper A vs `tmp/<paper>_gold.md` for paper B — different paths). Never have two agents writing the same file.
-2. **No shared git state.** Never run two parallel agents that modify git (commits, tags, branches, pushes). Git operations are sequential.
-3. **No sequential dependency.** Agent B does not consume an artifact agent A produces in the same fan-out batch. If A→B, run sequentially.
-4. **Self-contained briefs.** Each subagent prompt is a complete, standalone instruction set — absolute paths, no references to "the prior conversation", no implicit context.
-If ANY checklist item fails, run sequentially.
-### Where to parallelize (and where to use `run_in_background: true`)
-| Phase / step | Parallel? | How | Background? |
-|---|---|---|---|
-| **Phase 2 broad-read** — render 8-10 sample papers from publishers | YES | One `Bash` subprocess per paper OR one `Agent` per paper-cluster | Foreground for ≤4; background for more |
-| **Phase 5d gold-extraction** — DELEGATED to `article-finder generate-gold`; docpluck-iterate does NOT generate golds | N/A | `article-finder` owns extraction + its own parallelization | N/A |
-| **Phase 5d verification** — compare rendered.md ↔ gold for each affected paper | YES | One `Agent` per paper (independent inputs) | **Background** (1-2 min each) |
-| **Phase 5d cross-paper sweep** — corpus-level pattern detection on 5 papers | YES, but only ONE agent for the whole sweep | Single agent reading 5 paper pairs and emitting a corpus-level findings list | Foreground (one call, ~3-4 min) |
-| **Diagnostic artifact capture** — `pdftotext` + `extract_pdf_structured` per paper | YES | One `Bash` per paper | Foreground (each <5s) |
-| **Phase 5b broad pytest** — independent of Phase 5d verification | YES | `Bash` with `run_in_background: true` | Background; check via `Monitor` |
-| **Phase 5c 26-paper baseline** — independent of Phase 5d | YES | `Bash` with `run_in_background: true` | Background; ~10 min |
-| **Phase 6c rendered ↔ tables-tab parity** — across affected papers | YES | One `Bash` per paper | Foreground; each <5s |
-| **Phase 8 Tier-3 prod parity** — across affected papers (POST-deploy) | YES | One `Bash` curl per paper | Foreground; each <10s |
-| **Phase 7 release** — version bump + commit + tag + push + auto-bump merge | **NO** | Sequential git operations | N/A |
-| **Phase 4 library fix** — code edits | **NO** | Orchestrator holds architectural context | N/A |
-| **/docpluck-cleanup, /docpluck-review, /docpluck-deploy** — meta-skill chain | **NO** to running 2 at once | Each is sequential per its own internal logic | Foreground; chain them |
-### Concrete fan-out patterns to use
-**Pattern A — obtain golds, then fan-out VERIFICATION for affected papers (typical Phase 5d):**
-```
-1. Identify N affected papers for this cycle.
-2. For each paper: resolve the canonical key and `ai-gold.py check` the shared cache.
-   On a miss, gold generation is DELEGATED to `article-finder generate-gold` — docpluck
-   NEVER dispatches its own gold-extraction subagent (see Phase 5d Step 1; 2026-05-16
-   directive). Copy each `reading` view to `tmp/<paper>_gold.md`.
-3. While golds are obtained, render the affected papers at the working-tree version
-   via a single `Bash` script that renders them in sequence (Camelot is not thread-safe;
-   keep render serial).
-4. As each gold is ready, optionally dispatch its verifier Agent immediately (background).
-   Or wait for all golds, then dispatch all verifiers in a single multi-tool-call message.
-5. Aggregate verdicts as they return; queue defects per rule 0e.
-```
-**Pattern B — background long-running tasks during planning:**
-```
-1. Kick off the 26-paper baseline (`Bash` with run_in_background=true).
-2. Kick off the broad pytest (`Bash` with run_in_background=true).
-3. While both run, do Phase 3 TRIAGE re-read + Phase 4 code edit planning.
-4. By the time you need 5b/5c results, they're already done.
-```
-**Pattern C — corpus sweep with a single agent:**
-```
-For the every-3rd-cycle corpus sweep, do NOT fan out 5 separate agents.
-Use ONE agent given paths to 5 paper pairs and ask for a corpus-level
-findings list. This produces a coherent ranking; 5 independent agents
-would each produce a local list and the orchestrator would have to
-merge them by hand.
-```
-### Anti-patterns to avoid
-| Anti-pattern | Why it's wrong |
-|---|---|
-| "Dispatch 10 agents to each fix a different defect" | Multiple agents editing the same library code → merge conflicts, lost work. Orchestrator does fixes. |
-| "Dispatch parallel agents to bump version" | Two agents racing on `pyproject.toml` / `__init__.py` / git → broken commits. |
-| "Dispatch parallel agents to render the same paper" | Same `tmp/<paper>_v<version>.md` written twice → race condition. |
-| "Skip the Pattern-A wait and do verify before gold exists" | Verifier needs gold as input; sequential dependency. |
-| "Dispatch a subagent to read the PDF and produce a gold" | docpluck-iterate does NOT generate ground truth — generation is delegated to `article-finder generate-gold` (one producer, one protocol; 2026-05-16 directive). A local extraction subagent re-forks the gold. |
-| "Use multiple agents for a single small task" | Subagent dispatch has fixed overhead (~30s). For tasks <60s, do it inline. |
-| "Subagents share my conversation context" | They DON'T. Each subagent prompt must be self-contained — give absolute paths, restate the goal, restate the discipline. |
-### When in doubt
-Default to **sequential** when:
-- You're not sure if outputs collide.
-- The task takes <1 minute (overhead > benefit).
-- The task modifies global state (git, env, settings).
-Default to **parallel** when:
-- 3+ independent items with the same pattern (papers, sections, checks).
-- Each item takes ≥2 minutes.
-- Each item has a distinct output path.
+**Operational detail — load on demand before fanning out:** [references/subagent-parallelization.md](references/subagent-parallelization.md) holds the 4-item safety checklist (all must pass or run sequentially), the per-phase where-to-parallelize + `run_in_background` table, the concrete fan-out patterns (A: golds-then-verify, B: background-during-planning, C: single-agent corpus sweep), the anti-patterns table, and the when-in-doubt sequential/parallel defaults. Read it whenever you are about to dispatch parallel work.
 ---

docpluck-2.4.90/.claude/skills/docpluck-iterate/references/subagent-parallelization.md ADDED Viewed

@@ -0,0 +1,96 @@
+# Subagent parallelization — operational detail
+> Extracted from `SKILL.md` (the "Subagent parallelization" MANDATE section) on 2026-06-15 to keep the SKILL under the size guideline. The MANDATE itself, the per-cycle self-check, and the provenance stay inline in SKILL.md; this file holds the safety checklist, the where-to-parallelize table, the concrete fan-out patterns, the anti-patterns, and the when-in-doubt defaults. Load on demand whenever you are about to fan out work.
+> **Added 2026-05-14 by user directive, RE-STATED 2026-05-15** ("use subagents to optimize the whole process whenever possible"). The re-statement means the directive slipped — treat this as a HARD MANDATE, not advice.
+**MANDATE (restated):** Before doing ANY batch of 2+ independent units of work inline, STOP and ask: *"could N parallel background subagents do this instead?"* If yes and the safety checklist passes, you MUST fan out. Doing serially in the orchestrator's own context what subagents could do in parallel is a process defect — it is slow and burns the orchestrator's context window. This applies to: per-paper renders, gold extractions, AI-gold verifications, broad-read reader-passes, diagnostic captures, cross-paper sweeps. The orchestrator keeps ONLY: code edits, git/release operations, version bumps, and <60s one-offs.
+Iteration is dominated by I/O + AI work that is naturally parallel across papers. The orchestrator (the `docpluck-iterate` skill) MUST aggressively fan out to `Agent` subagents whenever there are 2+ independent units of work, when the work passes the safety checklist below.
+## Safety checklist — must pass ALL 4 before parallelizing
+1. **No shared file state.** Each parallel unit must write to a distinct output path (e.g., `tmp/<paper>_gold.md` for paper A vs `tmp/<paper>_gold.md` for paper B — different paths). Never have two agents writing the same file.
+2. **No shared git state.** Never run two parallel agents that modify git (commits, tags, branches, pushes). Git operations are sequential.
+3. **No sequential dependency.** Agent B does not consume an artifact agent A produces in the same fan-out batch. If A→B, run sequentially.
+4. **Self-contained briefs.** Each subagent prompt is a complete, standalone instruction set — absolute paths, no references to "the prior conversation", no implicit context.
+If ANY checklist item fails, run sequentially.
+## Where to parallelize (and where to use `run_in_background: true`)
+| Phase / step | Parallel? | How | Background? |
+|---|---|---|---|
+| **Phase 2 broad-read** — render 8-10 sample papers from publishers | YES | One `Bash` subprocess per paper OR one `Agent` per paper-cluster | Foreground for ≤4; background for more |
+| **Phase 5d gold-extraction** — DELEGATED to `article-finder generate-gold`; docpluck-iterate does NOT generate golds | N/A | `article-finder` owns extraction + its own parallelization | N/A |
+| **Phase 5d verification** — compare rendered.md ↔ gold for each affected paper | YES | One `Agent` per paper (independent inputs) | **Background** (1-2 min each) |
+| **Phase 5d cross-paper sweep** — corpus-level pattern detection on 5 papers | YES, but only ONE agent for the whole sweep | Single agent reading 5 paper pairs and emitting a corpus-level findings list | Foreground (one call, ~3-4 min) |
+| **Diagnostic artifact capture** — `pdftotext` + `extract_pdf_structured` per paper | YES | One `Bash` per paper | Foreground (each <5s) |
+| **Phase 5b broad pytest** — independent of Phase 5d verification | YES | `Bash` with `run_in_background: true` | Background; check via `Monitor` |
+| **Phase 5c 26-paper baseline** — independent of Phase 5d | YES | `Bash` with `run_in_background: true` | Background; ~10 min |
+| **Phase 6c rendered ↔ tables-tab parity** — across affected papers | YES | One `Bash` per paper | Foreground; each <5s |
+| **Phase 8 Tier-3 prod parity** — across affected papers (POST-deploy) | YES | One `Bash` curl per paper | Foreground; each <10s |
+| **Phase 7 release** — version bump + commit + tag + push + auto-bump merge | **NO** | Sequential git operations | N/A |
+| **Phase 4 library fix** — code edits | **NO** | Orchestrator holds architectural context | N/A |
+| **/docpluck-cleanup, /docpluck-review, /docpluck-deploy** — meta-skill chain | **NO** to running 2 at once | Each is sequential per its own internal logic | Foreground; chain them |
+## Concrete fan-out patterns to use
+**Pattern A — obtain golds, then fan-out VERIFICATION for affected papers (typical Phase 5d):**
+```
+1. Identify N affected papers for this cycle.
+2. For each paper: resolve the canonical key and `ai-gold.py check` the shared cache.
+   On a miss, gold generation is DELEGATED to `article-finder generate-gold` — docpluck
+   NEVER dispatches its own gold-extraction subagent (see Phase 5d Step 1; 2026-05-16
+   directive). Copy each `reading` view to `tmp/<paper>_gold.md`.
+3. While golds are obtained, render the affected papers at the working-tree version
+   via a single `Bash` script that renders them in sequence (Camelot is not thread-safe;
+   keep render serial).
+4. As each gold is ready, optionally dispatch its verifier Agent immediately (background).
+   Or wait for all golds, then dispatch all verifiers in a single multi-tool-call message.
+5. Aggregate verdicts as they return; queue defects per rule 0e.
+```
+**Pattern B — background long-running tasks during planning:**
+```
+1. Kick off the 26-paper baseline (`Bash` with run_in_background=true).
+2. Kick off the broad pytest (`Bash` with run_in_background=true).
+3. While both run, do Phase 3 TRIAGE re-read + Phase 4 code edit planning.
+4. By the time you need 5b/5c results, they're already done.
+```
+**Pattern C — corpus sweep with a single agent:**
+```
+For the every-3rd-cycle corpus sweep, do NOT fan out 5 separate agents.
+Use ONE agent given paths to 5 paper pairs and ask for a corpus-level
+findings list. This produces a coherent ranking; 5 independent agents
+would each produce a local list and the orchestrator would have to
+merge them by hand.
+```
+## Anti-patterns to avoid
+| Anti-pattern | Why it's wrong |
+|---|---|
+| "Dispatch 10 agents to each fix a different defect" | Multiple agents editing the same library code → merge conflicts, lost work. Orchestrator does fixes. |
+| "Dispatch parallel agents to bump version" | Two agents racing on `pyproject.toml` / `__init__.py` / git → broken commits. |
+| "Dispatch parallel agents to render the same paper" | Same `tmp/<paper>_v<version>.md` written twice → race condition. |
+| "Skip the Pattern-A wait and do verify before gold exists" | Verifier needs gold as input; sequential dependency. |
+| "Dispatch a subagent to read the PDF and produce a gold" | docpluck-iterate does NOT generate ground truth — generation is delegated to `article-finder generate-gold` (one producer, one protocol; 2026-05-16 directive). A local extraction subagent re-forks the gold. |
+| "Use multiple agents for a single small task" | Subagent dispatch has fixed overhead (~30s). For tasks <60s, do it inline. |
+| "Subagents share my conversation context" | They DON'T. Each subagent prompt must be self-contained — give absolute paths, restate the goal, restate the discipline. |
+## When in doubt
+Default to **sequential** when:
+- You're not sure if outputs collide.
+- The task takes <1 minute (overhead > benefit).
+- The task modifies global state (git, env, settings).
+Default to **parallel** when:
+- 3+ independent items with the same pattern (papers, sections, checks).
+- Each item takes ≥2 minutes.
+- Each item has a distinct output path.

{docpluck-2.4.85 → docpluck-2.4.90}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,65 @@
 # Changelog
+## [2.4.90] — 2026-06-15
+**RC-1 Step 2 — per-band region-aware two-column re-extraction (ship-dark behind `DOCPLUCK_COLUMN_CORRECT_BANDED`, default OFF).** No `NORMALIZATION_VERSION` change — the default path is byte-identical; the flag only adds reading-order corrections upstream of normalize.
+The dominant defect on two-column APA papers: pdftotext serializes a page's columns interleaved, so Method/Results/Discussion order scrambles and paragraphs split with their continuations displaced. The Step-1 whole-page corrector (`extract_page_text_columns`, v2.4.82) **cannot reach** these pages — its bilateral y-row gate and full-height gutter strip both reject any page that carries a full-width band (a table row, banner, or wide title) crossing the column centre, which is exactly the failing pages (TRIAGE 2026-06-15: re-rendering with `DOCPLUCK_COLUMN_CORRECT_GENERAL=1` produced byte-near-identical output).
+**Step 2** (`extract_page_text_banded`, `docpluck/extract_columns.py`) segments a flagged page into horizontal y-BANDS: a band whose central gutter strip is glyph-free across its rows is two-column prose → re-extracted left-then-right; a band with full-width content (table/banner/title) is kept as-is. Bands are reassembled top-to-bottom at glyph-free cut lines (vertically-overlapping bands are merged to full-width so a tall title glyph is never bisected). Applied **only as a fallback** inside `splice_column_corrected_pages` when the whole-page path returns "", and under the **same unconditional word-preservation guard** — a band crop that drops/fabricates a word is rejected and the page kept as-is, so the flag can only ADD a pure reorder, never ship corruption.
+**Validation (2026-06-15).** Corpus word-preservation scan across 4 two-column papers (71 flagged pages): every accepted page is a pure reorder; the 6 residual band-cut clips are all guard-rejected (no corruption). Production smoke (flag ON) corrects ~44 pages across 5 papers, all word-safe; flag OFF is byte-identical. **AI-verify vs article-finder reading golds** (Sonnet subagents) on the two heaviest papers (`chan_feldman_2025_cogemo`, `chandrashekar_2023_mp`): both **ON_BETTER** — Procedure/Manipulations ordering and the Power-analysis paragraph split fixed (chan); `## Results` heading restored and column-token orphans suppressed (chandrashekar); **zero text-loss, zero hallucination, zero regression** vs flag OFF. New regression `tests/test_rc1_banded_column_real_pdf.py` (synthetic-layout geometry units + real-PDF word-preservation + ship-dark-default). Touched-path suite 302 passed; 26-paper baseline 26/26 (flag OFF).
+**Known residual (documented, not regressions):** ~6/71 flagged pages still clip a single word at a band cut and are guard-rejected (stay interleaved — no worse than today); hard title+sidebar pages (e.g. PSPB p1) degrade to full-width (no de-interleave). These are the remaining Step-2 refinement targets before the default can be flipped.
+## [2.4.89] — 2026-06-15
+**W0h — recover dropped-minus statistical coefficients from the LAYOUT channel (the no-CI residual W0g cannot reach).** `NORMALIZATION_VERSION` 1.9.34 → 1.9.35.
+Surfaced by `/docpluck-iterate` Phase-5d (B7 canary `ar_apa_j_jesp_2009_12_011`). On tight-kerned PDFs that draw the U+2212 minus in a dedicated symbol font, pdftotext **drops the glyph entirely**, so a body-prose coefficient `β = −.022, t(87) = .17, ns` reaches the text channel as `b = .022` — a **silent sign flip that inverts the statistical conclusion**. The existing W0g recovery needs a confidence interval to prove the sign; these betas report only a t-value, so W0g cannot reach them.
+**Root cause + recovery.** The dropped minus survives in the **layout channel** (pdfplumber) as an unmapped `(cid:N)` glyph in a symbol font (here `AdvP4C4E74`), immediately touching the coefficient. `normalize.recover_dropped_minus_via_layout` (W0h) detects this in the `<stat> = <minus><coef>` operator slot — an unmapped glyph whose left neighbour is `=` and whose right neighbour is a coefficient — and re-inserts the minus into the pdftotext text. Keyed on the **structural signature** (not paper/font identity): the `=`-anchor excludes `5.2 ± 0.3` (glyph between two numbers) so a dropped `±`/`≈`/dagger can never be mistaken for a sign; it flips only coefficients the layout proves negative, only as many times as found. Threaded through a dedicated `dropped_minus_layout` param (`render → extract_sections → normalize_text`) so the section pipeline's text-channel-only contract is preserved (F0 stays off). On `ar_apa` it recovers `β = −.022 / −.88 / −.428` and leaves the genuinely-positive `.48` untouched. Blast radius: of the 5 onboarded canary papers only `ar_apa` flips (exactly its 3 layout-visible negatives); the other 4 are byte-identical no-ops.
+**Documented limitation (NOT fixable in text+layout).** `ar_apa` `β = −.245` is drawn as **painted pixels** — its minus is absent from pdftotext AND pdfplumber chars/lines/rects/curves AND pdfminer's raw layer (proven). Only OCR could recover it; W0h deliberately leaves it rather than guessing. See `TODO.md` R5 Path 1.
+New regression `tests/test_dropped_minus_layout_recovery_real_pdf.py` (4 synthetic-layout unit tests pinning the `=`-slot guard + the no-flip-after-a-number guard, and 1 real-PDF render test). Minus/normalize/render/sections suite: 148 passed.
+## [2.4.88] — 2026-06-13
+**Camelot temp-file cleanup is best-effort — stop silently dropping every table on Windows.** (No `NORMALIZATION_VERSION` change — table channel only; output on POSIX/prod is unchanged.)
+Surfaced by `/docpluck-qa` (corpus render verifier, tag H) while validating v2.4.86/87: `efendic_2022_affect` rendered with **0** of its **4** in-body HTML tables. Reproduced on HEAD too (pre-existing, not from the layout fixes).
+**Root cause (Windows-only).** `tables/camelot_extract.py::extract_tables_camelot` writes the PDF to a `NamedTemporaryFile(delete=False)`, runs Camelot, then unlinks the temp file in a `finally` block. Under **camelot-py 2.0.0** on Windows, Camelot still holds the temp-file handle open at that point, so `Path(tmp_path).unlink()` raises `PermissionError [WinError 32]`. That exception propagated out of `extract_tables_camelot` and was swallowed by `extract_structured`'s broad `except Exception` (→ `method="…camelot_failed"`, `tables=[]`) — so **every** paper lost **all** Camelot tables on Windows, even though extraction itself succeeded. POSIX permits unlinking an open file, so prod/Linux/Railway never hit this; it was invisible outside Windows dev. (The drift to the camelot-py 2.0.0 major came via the unbounded `camelot-py[cv]>=0.11.0` pin.)
+**Fix.** The `finally` cleanup now swallows `OSError` (best-effort temp removal; the OS temp dir reclaims the file). Camelot table extraction is restored on Windows: `efendic_2022_affect` 0 → 5 tables (3 with HTML), corpus verifier PASS. New regression `tests/test_camelot_temp_cleanup.py` monkeypatches `Path.unlink` to raise and asserts `extract_tables_camelot` returns a list instead of propagating. Table/structured suite: 90 passed, 1 skipped.
+## [2.4.87] — 2026-06-13
+**F0 sources the body from the text channel — strip header/footer/footnote lines from pdftotext, never rebuild the body from spans.** `NORMALIZATION_VERSION` 1.9.33 → 1.9.34.
+Follow-up #1 from `docs/HANDOFF_2026-06-13_sciencearena_grobid_liteparse.md` — the deeper, L-001-preferred fix for the residual the v2.4.86 word-gluing patch left behind.
+**Root cause (architectural).** When `layout=` is supplied, the F0 step (`_f0_strip_running_and_footnotes`) was **rebuilding the entire body from `TextSpan.text`** and discarding the pdftotext `raw_text` the caller passed in (used only to locate footnote offsets). That rebuild-from-spans is what made *both* the v2.4.86 word-gluing (pdfplumber's char stream drops the inter-word space glyph) and the residual two-column interleaving (`_chars_to_spans` groups chars by y-coordinate only, so left+right columns at the same y merge into one span — e.g. `how www.cambridge.org/cns we can pay for it`) possible. It violates the documented text-channel/layout-channel split (CLAUDE.md, LESSONS L-001/L-007): the body must come from `extract_pdf` (pdftotext, which already infers spaces from the x-gap *and* serialises columns in reading order), and the layout channel should be used only to *identify* which lines are running headers/footers/footnotes.
+**Fix.** `_f0_strip_running_and_footnotes` now keeps the exact same span-based classification (repeating-header detection, body-y-band header/footnote position rules, font-size footnote test, table-region guard) but uses it only to build **strip-key sets**. It then walks the pdftotext `raw_text` line by line (splitting on both `\n` and the bare `\f` page separator, keeping separators so footnote byte-offsets stay exact), drops header/footer lines, moves footnote lines to the `\n\f\f\n` appendix, and keeps the rest **in pdftotext order with pdftotext spacing**. Keyed on the structural signature (content-key membership), so it generalises to any PDF; strip-keys that are pure-numeric or shorter than 4 chars are excluded (page numbers etc. are handled downstream by P0/P0r/R2 and are absent from the JATS body), removing the only realistic false-strip vector. The body is now provably a *line-subsequence* of the text channel — F0 only deletes, never reorders or merges.
+**Validation (ScienceArena pdf-text-fidelity-v1 held-out PMC corpus, 30 papers, token-F1 vs JATS gold).** Mean token-F1 **0.745 → 0.776** (above raw pdftotext 0.750; on par with normalized-pdftotext 0.767, and above the per-paper cases where F0's header/footnote stripping adds lift). Primary metric (`0.5·levenshtein + 0.5·token_f1`, the leaderboard rank key) **0.559 → 0.666** — the large jump comes from levenshtein (~0.30 → ~0.60) now that column reading-order is correct. Zero catastrophic-zero papers (unchanged). `_join_chars_with_spaces` (v2.4.86) is retained — it is now load-bearing for span-text key matching and for the sections/tables layout consumers.
+New regression tests in `tests/test_normalize_f0_footnote_strip.py`: the F0 body is a line-subsequence of the text channel (a span-rebuild regression reorders/merges and fails), and every kept body line is an exact substring of `raw_text` (spacing comes from the text channel, never a glued span). LESSONS L-007 updated: the architectural follow-up is now done.
+## [2.4.86] — 2026-06-13
+**F0 layout body channel — reinsert inter-word spaces from the x-gap (stop gluing words on tight-kerned PDFs).** `NORMALIZATION_VERSION` 1.9.32 → 1.9.33.
+Surfaced by `docs/HANDOFF_2026-06-13_sciencearena_grobid_liteparse.md` (ScienceArena pdf-text-fidelity-v1 held-out PMC corpus): on ~16 of 30 real biomedical PDFs docpluck's `normalize_text(..., layout=...)` body scored token-F1 **≈ 0.00** against the JATS gold while raw pdftotext scored 0.7–0.9 — *with a normal character count*. Same characters, zero token overlap: the words were glued (`CNSSpectrums`, `Thebehavioralhealthcarecontinuuminthe`, `UnitedStates`).
+**Root cause.** When `layout=` is supplied, the F0 step (`_f0_strip_running_and_footnotes`) rebuilds the body from `TextSpan.text`. Span text was built in `extract_layout._chars_to_spans` by `"".join(c["text"] for c in line)` — a naive character concatenation with **no x-gap handling** (the function's own docstring claimed x-gap splitting that was never implemented). pdfplumber's `chars` stream omits the inter-word space glyph on tight-kerned PDFs (Cambridge journals, many two-column layouts) — pdftotext infers those spaces from the horizontal gap, but the raw char stream does not carry them. So on those PDFs the entire layout body collapsed to space-ratio ~0.005 (vs ~0.13 for the same text via pdftotext), and any consumer using the layout body channel (the recommended body-fidelity path since v2.4.83) silently emitted unspaced text. This is the long-standing `feedback_pdfplumber_extract_words_unreliable` failure mode — "always carry a char-level absolute-x-gap fallback" — which had never been applied to span text.
+**Fix.** New `extract_layout._join_chars_with_spaces` reinserts a space between consecutive glyphs when the horizontal gap exceeds a font-relative threshold (`gap > 0.20·font_size`), keyed on the structural signature (x-gap) rather than any paper identity, so it generalizes to every tight-kerned PDF. `0.20·size` reproduces pdftotext/JATS spacing to within ~0.2% space-density on the Cambridge/PMC corpus (PMC13064744 token-F1 0.00 → 0.86; space-ratio 0.0053 → 0.1282; full 30-paper held-out PMC token-F1 mean ~0.34 → 0.747 with zero catastrophic-zero papers, was 16/30). The guard never doubles an existing space and never splits within a tightly-kerned word. One known residual remains (documented, not addressed here): `_chars_to_spans` still does not split a line at a column gutter, so a line spanning two columns is merged — a smaller reading-order issue handled by the text channel, tracked for a follow-up.
+New regression tests in `tests/test_extract_layout.py`: x-gap space insertion on a word gap, no-split within a tight word, no double-space across an explicit space glyph, and a real-render assertion that normal loosely-kerned text keeps its word spaces.
 ## [2.4.85] — 2026-06-12
 **Harvard name-year reference splitting (D1) + page-break reference stitch & category-label running-header strip (D2).** `NORMALIZATION_VERSION` 1.9.31 → 1.9.32.

{docpluck-2.4.85 → docpluck-2.4.90}/LESSONS.md RENAMED Viewed

@@ -156,6 +156,102 @@ Demo showing the difference: `docs/superpowers/plans/spot-checks/splice-spike/ht
 ---
+## L-007 — Layout span text MUST reinsert inter-word spaces from the x-gap (never `"".join(chars)`)
+### The recurring mistake
+When a downstream step rebuilds text from the **layout channel** (`extract_pdf_layout`
+→ `TextSpan.text`), it is tempting to construct a line's text by concatenating
+pdfplumber's per-character `chars`: `"".join(c["text"] for c in line)`. This is wrong.
+pdfplumber's char stream **does not carry the inter-word space glyph** on tight-kerned
+PDFs (Cambridge journals, many two-column layouts) — pdftotext *infers* those spaces
+from the horizontal gap, but the raw chars do not. So the naive join glues whole lines
+into one token (`CNSSpectrums`, `Thebehavioralhealthcarecontinuuminthe`).
+### What it broke (2026-06-13, v2.4.86)
+`extract_layout._chars_to_spans` built span text with the naive join. Since v2.4.83 the
+F0 step (`normalize_text(..., layout=...)`) rebuilds the **body** from spans, so on
+~16 of 30 real biomedical PDFs the body collapsed to space-ratio ~0.005 (vs ~0.13 via
+pdftotext) — token-F1 ≈ 0.00 against the JATS gold *with a normal character count*.
+The defect was invisible to char-ratio/word-delta metrics (the chars are all there;
+only the spaces are gone) and was surfaced by ScienceArena's `pdf-text-fidelity-v1`
+held-out PMC set, where raw pdftotext beat docpluck. The function's own docstring even
+*claimed* x-gap handling that had never been implemented.
+### The rule
+- Any reconstruction of text from layout chars MUST reinsert a space when the
+  horizontal gap between consecutive glyphs exceeds a **font-relative** threshold
+  (`gap > 0.20·font_size` reproduces pdftotext/JATS spacing to ~0.2% space-density).
+  Use `extract_layout._join_chars_with_spaces`; never `"".join(chars)`.
+- This is the in-repo instance of memory `feedback_pdfplumber_extract_words_unreliable`
+  ("always carry a char-level absolute-x-gap fallback"). It applies to span text, and
+  to any future layout-channel text reconstruction (sections annotators, tables).
+- **A space-density collapse is the canary.** When a layout-derived body has space-ratio
+  far below the pdftotext text for the same PDF (e.g. < 0.05 vs ~0.13), suspect glued
+  word boundaries before anything else — it is not "dropped text."
+- Architecturally, the body is sourced from `extract_pdf` (pdftotext, which already has
+  correct spaces AND correct column reading-order) and the layout channel is used only
+  to *identify* lines to strip (running headers / footnotes), per L-001's
+  text-channel/layout-channel split. **Done in v2.4.87** (`NORMALIZATION_VERSION`
+  1.9.34): `_f0_strip_running_and_footnotes` no longer rebuilds the body from spans — it
+  builds strip-key sets from the span classification and deletes the matching lines from
+  the pdftotext `raw_text`, keeping the rest in pdftotext order/spacing. This closed the
+  residual two-column interleaving (`how www.cambridge.org/cns we can pay for it`) that
+  the v2.4.86 spacing patch left behind, and lifted the held-out PMC token-F1 mean
+  0.745 → 0.776 (primary 0.559 → 0.666). The F0 body is now provably a line-subsequence
+  of the text channel (guarded by
+  `tests/test_normalize_f0_footnote_strip.py::test_f0_body_is_a_line_subsequence_of_the_text_channel`).
+  Rebuilding the whole body from spans is the smell that made the gluing bug possible; do
+  not reintroduce it.
+Cite: `docpluck/normalize.py` (`_f0_strip_running_and_footnotes`),
+`docpluck/extract_layout.py` (`_join_chars_with_spaces`),
+`tests/test_normalize_f0_footnote_strip.py`, `tests/test_extract_layout.py`,
+CHANGELOG 2026-06-13 (v2.4.86 spacing, v2.4.87 body-source).
+---
+## L-008 — Temp-file cleanup must be best-effort; a broad `except` around extraction will swallow a cleanup error into total silent failure
+### The recurring mistake
+A function writes input to a `NamedTemporaryFile(delete=False)`, runs an external
+library, and unlinks the temp file in a `finally` block. The caller wraps the whole
+call in `except Exception: return []`. If the unlink raises, the exception escapes the
+`finally`, the caller's broad `except` swallows it, and the **successful** extraction
+result is discarded — a total, silent, output-zeroing failure that looks like "the
+tool found nothing."
+### What it broke (2026-06-13, v2.4.88)
+`tables/camelot_extract.py::extract_tables_camelot` unlinked its temp PDF in a
+`finally`. Under **camelot-py 2.0.0 on Windows**, Camelot still held the file handle
+open, so `Path(tmp_path).unlink()` raised `PermissionError [WinError 32]`. The
+exception propagated into `extract_structured`'s `except Exception` →
+`camelot_failed`, `tables=[]` — so **every** paper lost **all** tables on Windows even
+though Camelot had extracted them fine. POSIX allows unlinking an open file, so
+prod/Linux/Railway never saw it; it was invisible outside Windows dev and only caught
+by the corpus render verifier (tag H, 4 tables → 0).
+### The rules
+1. **Temp-file cleanup is always best-effort.** Wrap `unlink`/`rmtree` of a temp path
+   in `try/except OSError: pass` (or use a tempdir context that tolerates it). A
+   failure to delete scratch is never worth failing — or silently zeroing — the real
+   result. The OS temp dir reclaims it.
+2. **A platform-specific cleanup failure is invisible on the platform you test prod on.**
+   POSIX `unlink`-while-open succeeds; Windows refuses. If extraction works in CI/Linux
+   but returns empty locally on Windows (or vice-versa), suspect a `finally`-block
+   cleanup raising under a held file handle before suspecting the extractor.
+3. **A broad `except Exception` around a subprocess/library call hides this class.**
+   When such a wrapper exists, the inner function must not raise on cleanup — otherwise
+   "tool failed, 0 results" silently conflates real failure with a cosmetic cleanup error.
+4. **Pin breaking-major dependencies.** The drift to camelot-py 2.0.0 came through the
+   unbounded `camelot-py[cv]>=0.11.0` pin. Settled-on deps should carry a tested upper
+   bound (see memory `feedback_no_silent_optional_deps`); a major bump is opt-in + re-verified.
+Cite: `docpluck/tables/camelot_extract.py` (`extract_tables_camelot` `finally`),
+`docpluck/extract_structured.py` (the broad `except`), `tests/test_camelot_temp_cleanup.py`,
+CHANGELOG 2026-06-13 (v2.4.88).
+---
 ## When to add a new lesson here
 Add a lesson when:

{docpluck-2.4.85 → docpluck-2.4.90}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.85
+Version: 2.4.90
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://docpluck.app
 Project-URL: Documentation, https://docpluck.app/api-docs
@@ -22,7 +22,7 @@ Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Scientific/Engineering :: Information Analysis
 Classifier: Topic :: Text Processing :: General
 Requires-Python: >=3.10
-Requires-Dist: camelot-py[cv]>=0.11.0
+Requires-Dist: camelot-py[cv]<3.0,>=0.11.0
 Requires-Dist: pdfplumber>=0.11.0
 Provides-Extra: all
 Requires-Dist: beautifulsoup4>=4.12.0; extra == 'all'

docpluck 2.4.85__tar.gz → 2.4.90__tar.gz

docpluck 2.4.85tar.gz → 2.4.90tar.gz