PyPI - docpluck - Versions diffs - 2.4.99__tar.gz → 2.4.102__tar.gz - Mend

docpluck 2.4.99tar.gz → 2.4.102tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (449) hide show

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/_project/lessons.md RENAMED Viewed

@@ -653,3 +653,40 @@ Rotation picks `pool[(N mod L) : (N mod L) + rotation_size]` wrapping. Over `cei
 ## 2026-06-25 — Word-multiset preservation is BLIND to reading-order regressions; AI-verify is mandatory before any reorder flip
 The RC-1 banded column-correction flag passed a full-corpus word-preservation scan (26/26 baseline papers, 0 multiset violations) — looked safe to flip default-ON. The 8-canary AI-verify then found **3 ON_REGRESSION** the scan was structurally blind to: running-header furniture injected into prose, Abstract/Intro section order inverted, prose fragmented on a *single-column* paper (false-positive gutter). A pure reorder/furniture-injection preserves the word multiset exactly. **Never flip a reading-order-affecting default on word-preservation (or char-ratio / Jaccard) evidence alone — those gates cannot see "right words, wrong order / wrong place." AI-verify against the gold is required, and the bar is zero ON_REGRESSION across the full canary set.** Re-validates the project ground-truth hard rule.
+## 2026-07-02 · docpluckapp (frontend + service, NOT the library) · "Unhandled error in /api/extract" = heavy extraction outran the transport, escaping as an unhandled throw
+**What (fix).** Daily digest showed 2 distinct fingerprints / 6 total, all `POST /api/extract?normalize=academic&quality=true&structured=true&sections=true` in one ~47-min window. Ground truth from prod `system_logs.context.errorStack` (queried read-only via the Neon `DATABASE_URL` in `PDFextractor/frontend/.env.local`): `TimeoutError: The operation was aborted due to timeout` ×3 and `SocketError: other side closed` (undici `onHttpSocketEnd`) ×3 — two faces of ONE slow request. Root cause was two-part: (1) **the FastAPI service blocked its own event loop** — `/extract` `/sections` `/render` `/tables` `/analyze` are `async def` but called the synchronous CPU-bound docpluck entrypoints (`extract_pdf`, `extract_pdf_structured` [Camelot], `extract_sections`, `render_pdf_to_markdown`) directly on the loop, so a big structured+sections paper held the loop tens of seconds, starving Railway's healthcheck/keep-alive → Railway edge dropped the socket; (2) **the Next route didn't guard the response-body read** — `fetch()` resolves on headers, so `await serviceResponse.json()` (the body read) was OUTSIDE the try/catch that wrapped `fetch()`; a mid-stream abort/close threw there and escaped to Next's `onRequestError`. Fixes (all general, not paper-specific): service `_offload()` (thin `starlette.concurrency.run_in_threadpool` wrapper) on all 20 heavy call sites; route `classifyTransportError()` (TimeoutError/AbortError→504, else 502) wrapping BOTH the fetch-init catch and the body read, plus a malformed-2xx-body→502 guard and `export const maxDuration = 150`; Dockerfile `uvicorn --timeout-keep-alive 75`; and — surfaced by /docpluck-review rule 19 — the handler was NOT wrapped in `withRouteErrorLogging` (pre-existing), now `export const POST = withRouteErrorLogging("api/extract", handleExtract)`. Regression test `frontend/src/app/api/extract/transport-error.test.ts` (pure classifier extracted to `transport-error.ts` so it imports no `next/server` and runs under `node --test`) asserts the two exact incident signatures map to 504/502.
+**How to detect / avoid next time.** (1) **Never call a synchronous CPU-bound library from inside an `async def` FastAPI handler** — offload to the threadpool, or one heavy request starves healthchecks and serializes all traffic. (2) **`fetch()` resolves on headers; the body read is a SEPARATE failure surface** — guard `await res.json()`/`.text()`, not just the `fetch()` call. (3) **Pin `maxDuration` above any in-handler fetch timeout** so your own clean error fires before the platform's opaque 504. (4) A repo-wide pre-existing gap can hide a whole error class: the service test suite could not even run locally (async httpx fixture needed `anyio`+`httpx`, neither declared/installed; local FastAPI 0.137.2 was missing `annotated-doc` — `pip check` flags it). Added `service/requirements-dev.txt` and installed the missing dep; full suite now runs (156 passed / 2 skipped). Companion to the 2026-06-20 daily-digest entry (same `system_logs` ground-truth technique). App-only change — no library version bump, deploys via docpluckapp `master` push → Vercel + Railway.
+---
+## 2026-07-03 · A DISABLE_CAMELOT glyph-recovery test does NOT prove the production Camelot HTML-table path
+**What.** `recover_minus_via_ci_pairing` (W0d) recovers a `2`-for-U+2212 minus on a
+bracket-less point estimate by pairing it with the CI in the SAME record. It fired
+for efendic in the `DOCPLUCK_DISABLE_CAMELOT=1` unstructured-table channel (CI on the
+same text line) but **silently missed every negative B-coefficient in the Camelot
+HTML-table channel** — the production default — because Camelot emits each `<td>` on
+its own line, so the SE cell between the B cell and the CI cell pushes the char-gap
+(37) past the 30-char bare-bracket proximity cap. The existing regression test passed
+only because it set `DISABLE_CAMELOT=1`. Result: 27 corrupt cells (`20.09` for `−0.09`)
+shipped in production for the whole v2.4.x series while the test stayed green.
+**How to detect.** For ANY glyph/text recovery that can reach a table cell, render the
+affected paper WITH Camelot ON (the plain `render_pdf_to_markdown`, no env) and grep the
+corruption signature in `<td>` cells — do NOT trust a `DISABLE_CAMELOT` test. Each glyph
+fix must be verified in all three channels (body normalize, Camelot `cell_cleaning`, and
+the final `render_pdf_to_markdown` post-process) — memory `glyph-fixes-need-all-three-text-channels`.
+**How to fix (general).** In an HTML table row, columns pair by geometry, not prose
+distance. Relax a bare bracket's proximity to the labeled-bracket rule when the record
+is a `<tr>` (`"<td" in record`), keeping `_INDEPENDENT_STAT_BETWEEN_RE` as the guard so
+it still can't pair back across a different estimate's column. Prose lines keep the
+strict gap. Keyed on `<tr>` structure, not paper identity → general.
+**Bonus (leave-nothing-behind).** Widening a recovery is the moment to run an adversarial
+"what could this now WRONGLY recover" battery — it caught a pre-existing bare-bracket FP
+(prose `SD = 2.01 … d = 0.09 [-1.86,0.04]` flipping `2.01`) that the happy-path test's
+slightly-different string had never exercised.

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -1368,3 +1368,29 @@ aren't skipped.
 ### Verification
 - `tests/test_whitespace_char_fallback.py` (5 cases incl. real-PDF ip_feldman T10 = all 7 gold rows recovered; synthetic tight-kerned recovery; word-space reinsertion; delegation wiring; single-column-no-fabrication guard). 15/15 pass on the whitespace/char/caption-table suite; 0 failures across the broad table/render run; word path proven byte-identical (restored verbatim).
+---
+## Run: 2026-07-03 · cycle 1 · @ v2.4.101 → v2.4.102 · verdict FAIL (A1 shipped; corpus still FAIL)
+**Goal:** user `/docpluck-iterate 5 hours`. TRIAGE (2026-06-25) was stale (predated v2.4.99/100/101) → cycle 1 = mandatory broad-read + full canary baseline reproduction at HEAD.
+### Headline finding — the canary corpus is 5/5 FAIL at v2.4.101 (reproduce-before-trust paid off)
+Rendered the 3 fixed + 2 rotating canaries at HEAD and dispatched 5 parallel Sonnet AI-gold verifiers (Claude Max). **All 5 FAIL.** The stale TRIAGE's DP-* items had mostly shipped, but the underlying clusters are wide open. New fresh TRIAGE written (`docs/TRIAGE_2026-07-03_head_v2.4.101_assessment.md`) with 4 root-cause clusters: A (GLYPH sign corruption), B (table data-loss/RC-T, architectural), C (heading split/demotion), D (figure-axis-label leak). Plus broad-read discovery of E1-E4 (title-glyph `T` injection, keyword→heading promotion, RSOS citation-masthead-before-title, Collabra body-sentence→heading).
+### What shipped (clean, verified) — A1: 2-for-minus reaches the Camelot HTML-table channel (v2.4.102)
+- **The bug:** efendic rendered **27 corrupt `2X.XX` B-column cells** (`20.09` for `−0.09`) across regression Tables 2-5 — a silently-WRONG number, the worst meta-science defect class. Root cause: `recover_minus_via_ci_pairing` (W0d) pairs a bracket-less estimate with the CI in its **same record** and recovers the minus by containment; but Camelot emits each `<td>` on its own line, so the **SE cell sits between the B cell and the CI cell**, pushing the char-gap to 37 (past the 30-char cap W0d applies to *bare* brackets). The recovery fired in the DISABLE_CAMELOT unstructured-table channel but **silently missed the Camelot HTML-table channel — the production default.**
+- **The blind spot (the key lesson):** the existing `test_efendic_table_point_estimates_recovered_via_ci` PASSED at HEAD — but it runs with `DOCPLUCK_DISABLE_CAMELOT=1`. So a green test masked a production-path defect. This is `glyph-fixes-need-all-three-text-channels` again: a recovery landed in one channel; the Camelot table channel was uncovered. **A DISABLE_CAMELOT test does NOT prove the production Camelot HTML-table path is correct.** Added a Camelot-ON real-PDF test that renders through the production default.
+- **The fix (general, structural):** inside an HTML `<tr>` columns pair by table geometry, not prose adjacency — a bare bracket now uses the relaxed (labeled) proximity, still guarded by `_INDEPENDENT_STAT_BETWEEN_RE`. Prose lines (no `<td>`) keep the strict 30-char cap. Keyed on `<td>`/`<th>` presence, not paper identity → generalizes to any regression-table with the same signature. 27→0 corrupt cells; genuine `2.56` (Direction row, `∈[2.42,2.69]`) correctly preserved by the containment invariant; idempotent.
+- **Leave-nothing-behind bonus:** while adversarially battery-testing, found a PRE-EXISTING bare-bracket FP — the tight-spaced majumder variant `M=5.37, SD=2.01, t(1827)=1.83, d=0.09 [-1.86,0.04]` is only 25 chars gap so the cap alone let `2.01` flip to `-.01`, but the CI is `d`'s. Fixed by running the independent-stat guard for EVERY bracket kind (not just labeled). Confirmed pre-existing at HEAD (git show comparison), not introduced by my change.
+### Verification (ground truth = AI-gold via article-finder, never pdftotext)
+efendic Camelot-ON render 27→0; independent Sonnet AI-gold re-verify: A1 resolved, B-column exact vs gold, **0 new regressions**. Full minus suite 22 pass (6 new); adversarial over-recovery battery all pass; broad normalize/render/sections/heading **557 pass**; flatten 39; RC-T degenerate-table (Camelot) 8; deterministic DISABLE_CAMELOT body-text diff of 8 corpus papers = **0 changed** (surgical). iterate-gate `--cycle 1` = FAIL on I3 only (the honest canary FAILs) — all other I-rules (I1/I2/I5/I8/I10/I11/I12) pass.
+### Honest standing verdict: FAIL (rule 0e-bis)
+A1 is one root-cause class shipped. efendic still FAILs (A2 `×`-for-`3`, A3 SE-spurious-minus, affiliation-leak remain); the other 4 canaries FAIL on clusters B/C/D. The run continues — A2 is cycle 2. Did NOT report "clean." Standing corpus verdict FAIL until the canary set is clean or budget exhausted.
+### Process wins
+- **Reproduce-at-HEAD before trusting a stale TRIAGE** (card `reproduce-triage-defect-at-head`) — the 2026-06-25 TRIAGE would have sent me chasing already-shipped DP-* items; the fresh 5-canary baseline surfaced the REAL open clusters.
+- **5 parallel Sonnet verifiers** for the canary baseline (subagent-parallelization mandate) — one orchestrator context, 5 independent verdicts in ~80s.
+- **Adversarial battery beat the unit tests** — a hand-written 7-case over-recovery battery caught the pre-existing majumder-variant FP that the existing test's slightly-different string missed. For a recovery-widening change, an adversarial "what could this now wrongly recover" battery is worth more than the happy-path test.

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-iterate/SKILL.md RENAMED Viewed

@@ -1,6 +1,6 @@
 ---
 name: docpluck-iterate
-description: Use when the user wants to run an autonomous library→local→deploy iteration loop on docpluck — fix-verify-release-deploy cycles working through a backlog of corpus defects until a stop condition is met (time budget, iteration count, corpus pass-rate threshold, or explicit "until X"). Self-improving: appends LEARNINGS each cycle and proposes SKILL.md amendments after recurring patterns. Triggers on phrases like "iterate on docpluck", "run the docpluck loop", "self-improve docpluck", "fix-and-deploy until X", "keep working on the corpus", or after a v2.x.y release when the user asks to continue iterating.
+description: 'Use when the user wants to run an autonomous library→local→deploy iteration loop on docpluck — fix-verify-release-deploy cycles working through a backlog of corpus defects until a stop condition is met (time budget, iteration count, corpus pass-rate threshold, or explicit "until X"). Self-improving: appends LEARNINGS each cycle and proposes SKILL.md amendments after recurring patterns. Triggers on phrases like "iterate on docpluck", "run the docpluck loop", "self-improve docpluck", "fix-and-deploy until X", "keep working on the corpus", or after a v2.x.y release when the user asks to continue iterating.'
 tags: [docpluck, python, fastapi, nextjs, vercel, railway, neon, iterate, orchestration, self-improving, qa, deploy]
 user-invocable: true
 argument-hint: "[--goal time:60m | iters:5 | baseline:26/26+full:95/101 | until:\"description\"] [--no-broad-read] [--dry-run]"

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-qa/SKILL.md RENAMED Viewed

@@ -29,7 +29,7 @@ If QA surfaces an issue — any issue, however small, whether pre-existing, alre
 - **Frontend:** Next.js 16 + Auth.js + Drizzle (in `frontend/`), port 6116
 - **Service:** Python FastAPI importing `docpluck` library (in `service/`), port 6117
 - **Database:** Neon Postgres (docpluck project)
-- **ESCIcheck PDFs:** `C:\Users\filin\Dropbox\Vibe\ESCIcheck\testpdfs\Coded already\` (56 PDFs, APA psychology papers)
+- **ESCIcheck PDFs:** `C:\Users\filin\Dropbox\Vibe\MetaScienceProjects\COREteamToolsTemplates\COREcoding\testpdfs\Coded already\` (50 PDFs, APA psychology papers)
 - **Test PDFs:** `test-pdfs/` (47 PDFs, 8 citation styles)
 - **Test suite:** `service/tests/` (151 tests across 6 files)

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-qa/references/check-13-escicheck-production.md RENAMED Viewed

@@ -14,7 +14,7 @@ if not API_KEY:
     print('SKIP: set DOCPLUCK_API_KEY env var to run production check')
     exit(0)
-ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\ESCIcheck\testpdfs\Coded already'
+ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\MetaScienceProjects\COREteamToolsTemplates\COREcoding\testpdfs\Coded already'
 pdfs = sorted(os.listdir(ESCI_DIR))[:10]
 results = []

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-qa/references/check-5-escicheck-library.md RENAMED Viewed

@@ -10,7 +10,7 @@ python -c "
 import os, re, sys
 from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
-ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\ESCIcheck\testpdfs\Coded already'
+ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\MetaScienceProjects\COREteamToolsTemplates\COREcoding\testpdfs\Coded already'
 pdfs = sorted(os.listdir(ESCI_DIR))[:10]  # First 10 alphabetically
 results = []

{docpluck-2.4.99 → docpluck-2.4.102}/.claude/skills/docpluck-qa/references/check-6-escicheck-local-webapp.md RENAMED Viewed

@@ -15,7 +15,7 @@ Then run:
 python -c "
 import os, re, json, requests
-ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\ESCIcheck\testpdfs\Coded already'
+ESCI_DIR = r'C:\Users\filin\Dropbox\Vibe\MetaScienceProjects\COREteamToolsTemplates\COREcoding\testpdfs\Coded already'
 pdfs = sorted(os.listdir(ESCI_DIR))[:10]
 # Test via the Python service directly (bypasses auth)

{docpluck-2.4.99 → docpluck-2.4.102}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,45 @@
 # Changelog
+## [2.4.102] — 2026-07-03
+**'2'-for-minus glyph recovery now reaches the Camelot HTML-table channel — every negative B-coefficient in a regression table (`20.09` for `−0.09`) is recovered, not just the ones in the no-Camelot fallback.** `NORMALIZATION_VERSION` → `1.9.37`. A cycle-1 AI-gold canary re-verify at v2.4.101 found efendic_2022_affect rendering **27 corrupt `2X.XX` B-column cells** across regression Tables 2–5 (gold `−0.09` → `20.09`, `−1.09` → `21.09`, …) — a silently-wrong statistic, the worst defect class for a meta-science tool.
+**Root cause (a text-channel-coverage gap, not a new corruption).** The `2`-for-U+2212 recovery `recover_minus_via_ci_pairing` (W0d) pairs a bracket-less point estimate with the confidence interval in its **same record** and recovers the minus by the containment invariant (a point estimate lies inside its own CI). Camelot emits each table cell on its own line, so inside a multi-line `<tr>` the SE cell sits **between** the B-column estimate and the CI cell — pushing the char-gap from the estimate to the bracket to 37 chars, past the 30-char cap W0d applies to *bare* (unlabeled) brackets. The recovery therefore fired in the `DISABLE_CAMELOT` unstructured-table channel (where the CI is on the same text line, close) but **silently missed every negative B-coefficient in the Camelot HTML-table channel — the production default.** The existing test passed only because it set `DOCPLUCK_DISABLE_CAMELOT=1` (a channel blind spot; memory `glyph-fixes-need-all-three-text-channels`).
+**The fix.** Inside an HTML table row the columns pair by table geometry, not prose adjacency — the CI column belongs to the estimate column of the same row regardless of the intervening SE cell. So a bare bracket in a `<tr>` now uses the same relaxed proximity as a *labeled* CI, still guarded by `_INDEPENDENT_STAT_BETWEEN_RE` (which rejects pairing back across a *different* estimate's column). A prose text line (no `<td>`) keeps the strict 30-char cap, so the majumder prose false-positive stays blocked. All 27 efendic cells recover; the genuine positive `2.56` (Direction row, `2.56 ∈ [2.42, 2.69]`) is correctly left by the containment invariant.
+**Also fixed (pre-existing, leave-nothing-behind).** The independent-stat guard now runs for **every** bracket kind, not only labeled ones. A tight-spaced prose variant `M = 5.37, SD = 2.01, t(1827)=1.83, d = 0.09 [-1.86, 0.04]` is only 25 chars from `2.01` to the bracket — within the cap — so the gap check alone let `SD = 2.01` wrongly recover to `-.01`; the CI is `d`'s and `t`/`d` intervene, so it is now rejected.
+**Verification** (ground truth = AI multimodal read of the source PDF via article-finder `reading` golds, **never** pdftotext / Camelot). efendic rendered with Camelot ON: 27 corrupt B-cells → 0; genuine `2.56` preserved; the recovery is idempotent (the render post-process applies W0d, so a second pass is a no-op). Independent Sonnet AI-gold re-verify: A1 resolved, all B-column values in Tables 2–5 exact vs gold, **0 new regressions**. Full minus-recovery suite 22 pass (incl. 6 new — multiline-HTML-row recovery, genuine-positive-preserved, prose-strict-gap, independent-stat-across guard); adversarial over-recovery battery all pass; broad normalize/render/sections/heading suite **557 pass**; table flatten 39 pass; RC-T degenerate-table (Camelot) 8 pass; a deterministic `DISABLE_CAMELOT` body-text render diff of 8 corpus papers = **0 changed** (the change is surgical — only the Camelot HTML-table branch + the specific prose FP pattern are affected). Known remaining efendic defects are queued as their own cycles and are NOT touched by this fix: `×`-as-`3` in interaction-term names (no handler yet), SE-column spurious minus, mid-Introduction affiliation leak. Triage: `docs/TRIAGE_2026-07-03_head_v2.4.101_assessment.md`.
+## [2.4.101] — 2026-07-02
+**Concurrent-session reconciliation: six in-flight table-extraction fixes landed onto ONE verified release.** `TABLE_EXTRACTION_VERSION` → `2.4.7`, `NORMALIZATION_VERSION` → `1.9.36`. Roughly five overlapping Claude Code sessions had left the working tree and six sibling branches tangled around the region-driven table-capture path (see `docs/superpowers/handoffs/HANDOFF_2026-07-01_reconcile_concurrent_table_*.md`). This release serializes them onto the committed **v2.4.100 greedy `_find_caption_for_table` + `_rescue_duplicate_starved_captions`** base.
+**Pairing architecture decision — the global-assignment refactor was REJECTED.** One session had replaced greedy pairing with an order-independent global max-token-overlap assignment (`_assign_tables_to_captions_global` / `_best_assignment_for_page`). It was empirically confirmed to **regress `test_chan_feldman_t6_prose_not_in_any_table`** (a degenerate prose grid promoted into a table) and, per the v2.4.100 record, reshuffles ~24 papers — the same net-harmful class of change the 2026-06-25 triage and memory `project_docpluck_region_driven_camelot` already recorded. Greedy + narrow rescue is kept; the good ideas from the refactor (bare-digit token exclusion) were ported onto it.
+**The six fixes:**
+1. **chandrashekar Table 3/4 side-by-side de-interleave.** A page with ≥2 captions straddling a whitespace **gutter** drives each caption's region to its OWN column (`_detect_column_gutters` / `_assign_caption_columns` / `_label_x_midpoint` / `_column_table_bottom`, isolated Camelot calls), rebuilding caption + body from that column. Table 4's `17×2` column-straddling merge → clean **9×2**; both captions gold-exact; Table 3 body de-contaminated. Gated behind a true side-by-side signature (inert on single-column / stacked pages).
+2. **efendic Table 1 (categorical) + Table 2 (detect).** Unified `detect.py`: a contiguous **aligned-row-run** (prose-robust, replaces the global column-stability fraction), **widen-aware geometry** (`_detect_geometry_widen_aware` keeps the more-columnar of the narrow/widened band), a contiguous-footnote-gap clamp, and `whitespace._is_categorical_grid` acceptance. Table 1 `3×2` → **5×3**; Table 2 `0×0` stub → **11×5** (all 8 coefficients).
+3. **efendic Table 3 running-header + Tables 2–5 caption-tail-prose strip.** `_RUNNING_HEADER_PATTERNS` gains an `Author et al. <page-number>` two-cell running-header shape (the `Efendic et al. 1179` row 0). `_drop_caption_first_row` also drops a leading single-cell **caption-continuation prose fragment** (`DV.`, `performance and their amount of planning in the subsequent trials.`), context-guarded to fire only above genuine multi-cell table structure. This is done UPSTREAM so a prose tail can't reach a `<th>` and trip `render._strip_phantom_camelot_tables`, which drops the whole table (it was silently dropping jdm_.2023.16 Table 7's real `13×10` grid).
+4. **collabra_77859 Tables 2↔3 same-page mispairing.** Bare integers excluded from the caption-overlap tokenizer (`_CAPTION_TOKEN_RE` → `[a-z]{3,}|\d+\.\d+`), so a stray `Table 2` / `Study 2` digit stops manufacturing false overlap; region-driven capture then pins Tables 2/3/4/5 correctly AND deterministically. (A reading-order visit-order tie-break was tried and **rejected** — it perturbed greedy visit order and regressed chan_feldman T6.)
+5. **cog_emo Table 8 caption-marker hint.** An absorbed `Table N.` first row (`_leading_table_caption_number`) authoritatively pins a grid to caption N, beating degenerate caption-token-overlap ties. Table 8 recovers its real **17×6** intercorrelation matrix; Table 9 its **12×8** (was a `0×0` stub).
+6. **Registered-Report major-section heading promoter** (`render.py`, `_promote_isolated_major_section_headings`) + **dropped-minus CI-upper recovery** (`normalize.recover_dropped_minus_ci_upper` wired into flatten / cell-cleaning / grid — a CI upper bound that lost its leading minus on a tight-kerned PDF, e.g. `[-0.78, -0.66]` parsed as `[-0.78, 0.67]`, is recovered by the estimate-containment invariant).
+**Verification** (ground truth = AI multimodal read of the source PDF, **never** pdftotext / Camelot, per the project rule): full unit + real-PDF table suites green (per-file, to dodge the Camelot cumulative-load flake); `test_chan_feldman_t6_prose_not_in_any_table` PASS; 904 render/section/normalize tests pass; full **101-PDF structured diff** vs v2.4.100 — Camelot verified **deterministic** on this host (two identical-code runs byte-identical, 392 tables each), so every diff is real, not flake; AI-gold re-verification on the changed papers returned **15 BETTER, 5 SAME, 0 WORSE** across a broad sample (recovered truncated/missing data, corrected caption→table pairings, correct caption-tail strips) — the one regression it surfaced (jdm_.2023.16 Table 7 render-drop) was root-caused and fixed in the same run.
+## [2.4.100] — 2026-07-01
+**bmc_med_3 duplicate-fragment pairing FIXED — a narrow, order-independent duplicate-starvation rescue in `_find_caption_for_table`'s aftermath.** `TABLE_EXTRACTION_VERSION` → `2.4.6`. Resolves the hard case explicitly deferred in v2.4.99 (both a same-page raw-text dedupe and a bbox-proximity pairing were tried and reverted there — each regressed another paper).
+**The bug.** On a page carrying ≥2 table captions, Camelot's `stream` sometimes emits two *near-identical fragments* of one table (stream + lattice both fire). The greedy `_find_caption_for_table` loop walks Camelot's tables in emission order and lets each claim the best still-free caption, so the first caption took one copy and the second caption — now the only free one — was handed the **duplicate** second copy, even though that copy is not its table. The second caption's real, distinct table then found no free caption and was dropped. **bmc_med_3 p8:** captions Table 2 + Table 3; Camelot emitted Table 2's 29×5 twice plus Table 3's real 11×6. Table 3 was given the second copy of Table 2's 29×5 and its own "Comparisons of SCE …" 11×6 grid was discarded.
+**The fix (`_rescue_duplicate_starved_captions`, extract_structured.py).** A post-pass over the greedy assignment that fires *only* on that exact signature: on a page where ≥2 captions were assigned tables with **identical `raw_text`**, the caption that overlaps the shared table best keeps it and each other ("starved") caption is reassigned to the best **unassigned** same-page table that fits it at least as well as the duplicate it was holding. Duplicates are kept as separate assignable tables (a genuinely two-caption / two-identical-grid page — bmc_med_4 — still fills both captions unchanged); only a real distinct table that greedy stranded is recovered. bmc_med_3 Table 3 → its real 11×6 (`method` gains `dup_rescue:1`), verified cell-for-cell against the AI multimodal read of the source page (SCE comparisons: Patients-with-SCE, lesion-diameter/volume IQRs, the "Classification of lesions" group-header row) — NEVER pdftotext, per the project ground-truth rule.
+**Why narrow, not a global re-assignment.** A full order-independent max-token-overlap matching per page *does* fix bmc_med_3, but a 101-PDF structured diff showed it **reshuffled ~24 papers and un-did verified token-overlap fixes** — chen / chandrashekar / ieee_access_4 / cmaj_2 swaps, and it *promoted a degenerate prose fragment to chan_feldman Table 6* (a real regression: `test_chan_feldman_t6_prose_not_in_any_table` went from pass to fail 3/3). Global re-optimization is exactly the class of change memory `project_docpluck_region_driven_camelot` and the 2026-06-25 triage recorded as net-harmful. The rescue is therefore keyed on the duplicate signature alone, so **every page without that signature keeps its greedy assignment byte-for-byte** — chan_feldman's T6 test passes 3/3, and a whole-corpus `dup_rescue`-firing scan confirms the new code path activates on only the intended page(s); every other paper runs the identical HEAD path and cannot be changed by this fix.
+**Verification** (ground truth = AI multimodal read / article-finder golds, never pdftotext): bmc_med_3 Table 3 AI-gold-verified; bmc_med_4 byte-identical (rescue correctly does not fire); `test_chan_feldman_t6_prose_not_in_any_table` 3/3 pass; whole-corpus `dup_rescue` scan shows the fix is surgical (Camelot's run-to-run non-determinism — memory `feedback_camelot_flake_cumulative_load` — defeats naive before/after fingerprint diffing, so surgicality is proven at the code-path level, not by comparing flaky shapes). Full table unit + real-PDF suite green (per-file, to dodge the Camelot cumulative-load flake).
 ## [2.4.99] — 2026-06-29
 **Region-driven table capture (DP-1/DP-2): drive Camelot with docpluck's own caption-anchored region as `table_areas` — AI-gold-verified, the first CAPTURE-PATH change in the v2.4.x table series.** `TABLE_EXTRACTION_VERSION` → `2.4.5`. Instead of running Camelot blind (`pages="all"`) and pairing tables to captions after the fact, each table caption now drives a `stream` extraction constrained to its OWN region (`extract_tables_camelot_by_region` + `_region_driven_capture` in `extract_structured.py`), so a caption gets exactly its table by construction. This recovers the **stacked / side-by-side multi-table-per-page** class that blind detection + token-overlap pairing could not separate — **efendic Tables 4+5 now split correctly** (was a 30×2 merge + empty stub), and **12 papers' previously-empty `0×0` stubs become real tables** (chandrashekar, jdm, ieee×4, bmc_med_3 T4, bmj_open_1, …).

{docpluck-2.4.99 → docpluck-2.4.102}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.99
+Version: 2.4.102
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://docpluck.app
 Project-URL: Documentation, https://docpluck.app/api-docs

{docpluck-2.4.99 → docpluck-2.4.102}/TODO.md RENAMED Viewed

@@ -243,3 +243,12 @@ Add only when a real downstream consumer asks for one. YAGNI until then.
 ### ScienceArena adapter (their repo — fixed this session, verify there)
 - [ ] **Re-run the ScienceArena benchmark** with docpluck ≥ 2.4.83 installed in ITS venv. The adapter fixes (commit `de35f4a` on sciencearena `main`: pass `layout=`, read `report.footnote_texts`, strip the caption label) are logic-verified against local docpluck but NOT run end-to-end there (docpluck isn't installed in that repo). Also recommend ranking `docpluck-standard` as the primary real-document variant (Greek preserved).
+- [ ] R-0040 | bug | P1 | from cross-project-learning-reviewer 2026-06-29 | per-row n mis-bound to comparison arm (collabra.90203 T10) + dropped-minus CI upper bound (cog_emo T8 sign-flip); T10 test encodes WRONG n (see project-review R-0040)
+  - **VERIFIED-AT-HEAD 2026-06-29 (interactive session) — split into two findings:**
+  - **Part A (per-row n mis-bound, collabra.90203 T10): NOT REPRODUCIBLE — false positive.** Ran `pytest tests/test_tables_superheader_alignment_real_pdf.py::test_collabra_90203_table10_all_six_conditions_real_pdf tests/test_tables_flatten.py::TestT90203Table10` → **3 passed**. The real-PDF test already asserts the CORRECT arm split (Target arm: r=.34, n=170; Replication arm: r=.63, CI[.53,.72]) and is GREEN. n=170 is the Target arm's correct n, NOT a wrong binding; the synthetic `TestT90203Table10` table is a 3-column no-arm table where n=170 is correct by construction. The reviewer's "test encodes the WRONG n" reading is the ~26%-false-positive class (cf. R-0006: reproduce at HEAD before fixing). **No fix needed for Part A** unless a NEW failing repro is produced.
+  - **Part B (dropped-minus CI upper bound, cog_emo T8): REAL — reproduces at HEAD.** Rendering `PDFextractor/test-pdfs/apa/chan_feldman_2025_cogemo.pdf` via `extract_pdf_structured` → `flatten_table`, Table 8 emits:
+    - `2bi:  r=-0.73 CI=[-0.78, 0.67]`  ← upper bound should be **-0.66** (sibling row above correctly reads `[-0.78,-0.66]`)
+    - `2bii: r=-0.43 CI=[-0.52, 0.33]`  ← upper bound should be **-0.33**
+    The minus on the **CI upper bound** is dropped (tight-kerned U+2212 in symbol font; pdftotext drops the glyph — same class as W0g/W0h betas, but on the CI's own bound). Existing recovery (`docpluck/normalize.py`: `recover_dropped_minus_via_ci_pairing` W0g @2663, `recover_dropped_minus_via_layout` W0h @2782) recovers *coefficient* tokens proven negative by their CI bracket — it does NOT recover a minus dropped from the **bracket itself**, because it trusts the bracket.
+  - **Fix path (do via `/docpluck-iterate`, NOT a blind edit):** add a general, structurally-keyed recovery for "CI upper bound whose minus was dropped" — likely a layout-channel check (W0h-style, via the `dropped_minus_layout` LayoutDoc already threaded into the pipeline @3005/3029) OR a point-estimate-containment invariant (when the reported point estimate is negative and the parsed CI is `[neg, pos]` such that the interval is implausibly asymmetric about the estimate, the positive upper bound is the dropped-minus victim — but tune carefully so legitimate zero-straddling CIs are NOT corrupted). MUST: key on the structural signature (general, not this one PDF); verify against the AI gold under `ArticleRepository/ai_gold/` (never pdftotext); add a regression test mirroring `test_dropped_minus_layout_recovery_real_pdf.py` for the CI-bound case; run the full 26-paper baseline for no regression. ESCIcheck already works around the downstream effect (NOTE-path + `ESCIcheckapp/docs/REPLY_TO_DOCPLUCK_2026-06-26.md`).

{docpluck-2.4.99 → docpluck-2.4.102}/docpluck/__init__.py RENAMED Viewed

@@ -78,7 +78,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.99"
+__version__ = "2.4.102"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

docpluck 2.4.99__tar.gz → 2.4.102__tar.gz

docpluck 2.4.99tar.gz → 2.4.102tar.gz