PyPI - docpluck - Versions diffs - 2.4.95__tar.gz → 2.4.97__tar.gz - Mend

docpluck 2.4.95tar.gz → 2.4.97tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (436) hide show

{docpluck-2.4.95 → docpluck-2.4.97}/.claude/skills/docpluck-deploy/SKILL.md RENAMED Viewed

@@ -84,28 +84,29 @@ print(f'All imports OK; docpluck=={info[\"version\"]} normalize={info[\"normaliz
 "
 ```
-### 4. Cross-Repo Library Version Sync (CRITICAL)
+### 4. Cross-Repo Library Version Sync (CRITICAL — "when we bump the package, we bump the app")
-Verify the app's `service/requirements.txt` git pin matches the library's latest tag. Mismatches mean the deploy will silently ship the OLD library to prod.
+Verify the app's `service/requirements.txt` git pin matches the library's latest released tag. A mismatch means the deploy silently ships the OLD library to prod. **The pin is read from docpluckapp `origin/master` (what Railway deploys), NOT the local clone — a stale local checkout shows an old pin even when prod is correctly synced, which almost causes a phantom "fix".** The shared gate (also run by `/docpluck-qa` check 11b and `/docpluck-review` rule 22) is the single source of truth:
 ```bash
-LIB_VERSION=$(grep '^__version__' C:/Users/filin/Dropbox/Vibe/MetaScienceTools/docpluck/docpluck/__init__.py | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
-APP_PIN=$(grep -oE 'docpluck.*@v[0-9]+\.[0-9]+\.[0-9]+' C:/Users/filin/Dropbox/Vibe/MetaScienceTools/PDFextractor/service/requirements.txt | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
-echo "Library __version__: $LIB_VERSION"
-echo "App requirements.txt pin: v$APP_PIN"
-if [ "$LIB_VERSION" != "$APP_PIN" ]; then
-  echo "❌ MISMATCH — bump PDFextractor/service/requirements.txt to docpluck @ git+https://github.com/giladfeldman/docpluck.git@v$LIB_VERSION before deploying"
+cd C:/Users/filin/Dropbox/Vibe/MetaScienceTools/docpluck && python scripts/check_app_pin_sync.py || {
+  echo "Cross-repo pin sync FAILED — recover before deploying:"
+  echo "  - re-push the tag: git push origin v<VERSION>  (re-fires bump-app-pin.yml), OR"
+  echo "  - hand-bump PDFextractor/service/requirements.txt to @v<VERSION> and push to docpluckapp master."
   exit 1
-fi
+}
+# Note: a working-tree __version__ ahead of the latest tag is reported UNRELEASED (not a failure) —
+# that is the normal pre-flight state; the "Library Release Step" below tags+pushes it, which
+# fires the auto-bump, and post-deploy check 3 confirms Railway /health reports the new version.
 # Also verify the API.md examples are not stale beyond a major version
+LIB_VERSION=$(grep '^__version__' C:/Users/filin/Dropbox/Vibe/MetaScienceTools/docpluck/docpluck/__init__.py | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
 API_DOC_VERSION=$(grep -oE 'docpluck_version["\s:]+[0-9]+\.[0-9]+\.[0-9]+' C:/Users/filin/Dropbox/Vibe/MetaScienceTools/PDFextractor/API.md | head -1 | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
 LIB_MAJOR_MINOR=$(echo "$LIB_VERSION" | cut -d. -f1,2)
 DOC_MAJOR_MINOR=$(echo "$API_DOC_VERSION" | cut -d. -f1,2)
 if [ "$LIB_MAJOR_MINOR" != "$DOC_MAJOR_MINOR" ]; then
-  echo "⚠️ API.md examples reference docpluck_version $API_DOC_VERSION; library is at $LIB_VERSION. Update PDFextractor/API.md."
+  echo "WARN: API.md examples reference docpluck_version $API_DOC_VERSION; library is at $LIB_VERSION. Update PDFextractor/API.md."
 fi
-echo "✅ Library version sync OK"
 ```
 ### 5. Verify Vercel Environment Variables

{docpluck-2.4.95 → docpluck-2.4.97}/.claude/skills/docpluck-iterate/LEARNINGS.md RENAMED Viewed

@@ -1275,3 +1275,55 @@ aren't skipped.
 2. **A bounded sample gives FALSE CONFIDENCE — the full-corpus regression gate (rule 19) is non-negotiable and earned its keep here.** The 11-paper diff said "only the target changed, ship it"; the 48-paper diff revealed 4 real-heading false positives. I nearly shipped a regression off a green bounded sample. ALWAYS run the guard-live-vs-bypassed diff over the WHOLE corpus before trusting a heading-promotion change — bounded samples miss the long tail where the FP pattern lives.
 **Open queue (run stays OPEN — standing verdict FAIL):** the cell-label cases are table-content-as-prose → fold into the table cluster, NOT a standalone render guard; the TABLE cluster (highest impact, architectural bbox decision outstanding since 2026-05-22 — needs user scope decision); RC-1 band path (multi-session, riskiest); residual metadata-leaks. The clean render-layer slice that DID ship this run (affiliation v2.4.92) plus the prior single-column v2.4.91 are done; the remainder is architectural.
+---
+## 2026-06-21 · Resume · cycle 3 · full real AI-verify @ v2.4.95 = 7/7 canary FAIL; corpus at an ARCHITECTURAL boundary (3 root causes, all need sign-off); 2 open_findings adjudicated NOT-defects
+**Target:** "keep addressing todo, iterating and improving." Found the run half-open after the v2.4.95 Request-11 ship (cycles 1-2), TRIAGE stale @ v2.4.88, 2 carried-over open_findings. **Verdict: cycle 3 = VERIFY + ADJUDICATE + SURFACE-DECISION, NO code shipped.** Standing FAIL.
+**The canary-audit clobber masked a fully-broken corpus AGAIN (memory `feedback_canary_audit_clobbers_phase5d`, re-confirmed live).** The fresh HEAD canary render (`canary-2dbdd98`) carried 5 `verdict:PASS` files — but every one was `raw_verdicts:[AUDIT_DEFERRED_TO_AGENT,AUDIT_DEFERRED_TO_AGENT] → union PASS`. `AUDIT_DEFERRED` means the headless Sonnet deferred to the in-session agent and the hook recorded the *non-verdict* as PASS. Re-running 7 real in-session Sonnet verifiers vs the article-finder golds → **7/7 FAIL.** **Never trust a canary-audit PASS whose `raw_verdicts` are `AUDIT_DEFERRED`; it is a placeholder, not a verification.** Re-verify manually after a commit before trusting I3-green.
+**Narrow-scope verification hides broad defects (Phase-0.8 cross-output lesson, re-proven).** maier_2023_collabra was recorded PASS in cycles 1-2 — but those only checked the *specific* Request-11 flatten fields (T8/T10). The full-document verify this cycle shows maier broadly FAIL: T5 unstructured fallback, T7 garbled with body text, T8/T9/T11 empty headers, section displacement. The narrow check wasn't wrong for its scope; it never looked at the rest of the doc. **A per-feature "PASS" is not a per-document PASS — when the canary rotation comes around, verify the WHOLE document, not the fields the last cycle touched.**
+**All 7 papers' defects cluster to 3 ARCHITECTURAL root causes (TRIAGE_2026-06-21):**
+1. **RC-T table-bbox** (widest — all 7, single + two-column): Camelot grabs furniture/adjacent prose → empty shells, garbled cells, missing headers, duplicate dumps, orphan `### Table N`. Proof: ip_feldman Table-10 cells = running-header `Ip and Feldman` + page `15` + `Discussion` heading + Discussion prose + 1 real row, all bbox `(0,0,0,0)` → render *correctly* drops the `<table>`. Bbox decision open since 2026-05-22.
+2. **RC-1 column/sidebar interleave** (chan_feldman, chandrashekar, plos_med sidebar): the furniture-strip is *defeated by* the interleave (plos_med's whole front-matter sidebar lands before the Abstract; `## Abstract Published: <date>` weld). Spec ready (2026-06-08 region-aware).
+3. **RC-B7 deleted-minus glyph** (ar_apa): 5 body-prose betas sign-flipped `β=−.022`→`b=.022`. 3-path decision pending.
+**No clean non-architectural win remains — chased the 3 best leads, each bottomed out in a root cause above** (plos_med masthead-strip = interleave; ip_feldman Table-10 splice = bbox-garbage; abstract-date weld = interleave). Per the skill (avoid C4 without sign-off) + must-stop ("fix needs an architectural decision"), surfaced the 3-way decision to the user rather than diving in blind.
+**2 open_findings adjudicated NOT docpluck defects (LEAVE NOTHING BEHIND — inspected locally, did not defer):**
+- `collabra_77859` "Table 3" vs gold "Table 2": source text-channel caption (line 866) is verbatim `Table 3. Study 4: Dish sets`; docpluck correct, **gold mis-numbered** → article-finder.
+- `collabra_90203` Table 10 r=.59 vs .63: pdftotext literally emits `.59` (text-line 1706), Camelot agrees; `.63` is visual-only → source text-layer/visual divergence, OCR-only. Documented limitation.
+**Addendum (same session, user authorized "do all three, 1-3"): RC-B7 was ALREADY DONE; RC-T root-caused; checkpointed before the big multi-session table work.**
+**RC-B7 (authorized #1) = already implemented — verify the codebase before re-solving an "architectural" item.** I set out to build the B7 layout-channel minus recovery the old TRIAGE called for — and found it already exists as **W0h** (`normalize.recover_dropped_minus_via_layout`), wired (render.py:5079→sections→normalize.py:3170) and regression-tested (`tests/test_dropped_minus_layout_recovery_real_pdf.py`). HEAD renders `b=-.022 / -.88 / -.428` correctly (4/5 ar_apa betas). My own cycle-3 FAIL was a **verifier over-flag** — it quoted the W0h-recovered `-.022`/`-.428` and still called them GLYPH defects. **Lesson: before treating a TRIAGE "architectural, needs sign-off" item as open, grep the library for an existing implementation — a prior session may have already shipped it. The verifier's FAIL is a hypothesis, not a fact; reproduce + read the code at HEAD first.** Residuals (`.245` pixel-minus, β→b) confirmed OCR-tier: probed BOTH channels — pdfplumber also extracts `b` (font AdvPSMP10) and shows no `.245` minus glyph; outside docpluck's MIT text+layout architecture, already documented in the W0h comment.
+**RC-T (authorized #2) root cause = the FULL-PAGE-BBOX signature.** ip_feldman Table 10's region bbox is `(53, 53, 577, 800)` — top→bottom spans the whole of page 15 (vs Tables 1-9 = tight sub-region bands), so the "table" swallowed the running header `Ip and Feldman` + page `15` + `Discussion` heading + Discussion prose + 1 real data row (cells all bbox `(0,0,0,0)`). **The fix must key on CELL CONTENT (furniture/prose signatures), NOT bbox-size** — legitimate landscape Tables 6/7/8 also have tall bboxes; a degenerate region → clean unstructured fallback (no orphan `### Table N`, no prose-as-cells). This is a multi-session, high-regression-surface change.
+**Pacing decision (per LEARNINGS rule: full-corpus gate is non-negotiable; don't rush heading/table changes at session-tail).** After a long verification+investigation session, I deliberately did NOT start the RC-T/RC-1 implementation — a table-bbox change rushed without a careful full-corpus regression pass is exactly how the cycle-3 caption-follows revert happened. Checkpointed with the characterization done so RC-T can be a focused dedicated effort. Standing verdict stays FAIL (RC-T + RC-1 open).
+---
+## Run: 2026-06-21 (PM) · cycle 1 · RC-T Option A implemented → v2.4.96
+### Outcome
+- **SHIPPED (incremental):** RC-T degenerate prose-table strip — `render.py::_strip_phantom_camelot_tables`. Net corpus impact: **exactly 2 tables** now fail-clean (maier_2023 Table 7, chan_feldman Table 6), all else byte-identical. ip_feldman Table 10 (the RC-T canonical "orphan") was **already** fail-clean at HEAD; the real fixable garbage was elsewhere. Standing verdict remains **PARTIAL/FAIL** — RC-T Layer-1 recovery + RC-1 interleave are deferred (user-scoped to Option A), so canaries still FAIL on those classes.
+### Blind Spots (corrections to the prior session's RC-T characterization)
+- **The prior entry framed RC-T as "full-page-bbox → emit a cell-content degenerate guard / route to unstructured fallback / no orphan `### Table N`."** Reproduction at HEAD corrected this: ip_feldman T10 is **already stripped** at HEAD (its folded-prose `<th>` already trips `_strip_phantom_camelot_tables` via the fn≥3 path), rendering `### Table 10` + caption + no table = already fail-clean. The actual open gap was much smaller: **"the" — the single most common English function word — was missing from `_FUNCTION_WORDS_IN_PROSE`**, so a body-prose `<th>` with `fn==2 (+the)` slipped under the bar (maier T7 "Following the analyses conducted in Study 1 of Small": fn=in,of=2; verb=following,conducted=2). Lesson: a "build a whole new guard" item can collapse to a one-word set fix once you reproduce + read the EXISTING guard at HEAD (compare RC-B7's "already implemented as W0h").
+- **Cell-content "majority prose/furniture" is NOT a safe degenerate discriminator** — my first attempt (a data-layer `_table_cells_are_degenerate` keyed on majority prose) FP'd immediately on chan_feldman Table 5, a **legitimate comparison table whose cells are full descriptive sentences by design** (prose-in-cells is normal for comparison/design/instrument tables). Reverted. The reliable signature is narrower: a **≥8-word sentence-shaped `<th>`** (real column headers are short noun phrases; a sentence in the header == Camelot folded body prose). Legit prose-bearing tables keep short headers.
+### Edge Cases (the title-leak FP — caught only by the full-corpus scan)
+- **A real table can leak its own TITLE into the `<th>`** (Camelot whole-page bbox grabs the caption row as the header). `aom/amp_1` Table 5's `<th>` is its caption "Improving Scholarly Impact … Practice" over a REAL grid (Domains/Policymaking/Practice + data). Adding "the" naively strips this real table = regression. **Discriminator: caption-token overlap.** A title-leak `<th>` shares ≥60% of its tokens with a caption line; a body-prose `<th>` (maier T7, chan_feldman T6) shares ~none. Gate the new strip on `not is_title_leak`.
+- **A relaxation can secretly be a broadening.** My first title-leak exclusion skipped title-leak `<th>`s for ALL fn-paths — which would have changed the verdict for **37 tables already stripped at HEAD** (many with title-shaped `<th>`s like jama-open-1 T3 "Effect of Time-Restricted Eating…", bmc-pub-health-2 "Table 2 Collection of equations…"). Whether each of those 37 is correct-strip vs wrongly-stripped-real-table is a **pre-existing, unverified question**. Fix: **scope the new behavior to the marginal case only** — fire only when "the" crosses fn 2→3 (`fn_count<3 and fn_count+the_count>=3`), so every HEAD-stripped table is byte-identical and only genuinely-new body-prose strips are added.
+### Improvements (verification methodology)
+- **th-level full-corpus FP scan beats a bounded render sample AND a full render-diff for a surgical guard change.** `tmp/repro/fp_scan.py` extracted all 152 corpus PDFs once and, per table, computed the exact `_strip_phantom` th counts with/without "the" — pinpointing the precise change set (3 fn-flip tables; 37 HEAD-stripped candidates for un-strip regression) in ~13 min, far cheaper than 304 renders. For a change confined to one code path, analyze THAT path corpus-wide rather than diffing whole renders. (Still confirmed the deciding cases with real renders.)
+- **The full-corpus gate earned its keep AGAIN.** The 4-paper sample (ip_feldman/maier/chan_feldman/plos_med) looked clean at every step; the FP (amp_1 T5) and the 37-table over-broad-exclusion lived ONLY in the long tail — exactly the cycle-3-caption-follows trap. Never trust a bounded sample for a table/heading guard.
+### Verification Gaps / Deferred (queued, not dropped)
+- **RC-T Layer-1 recovery** — actually RECOVERING lost table data via tight `table_areas` (plos_med T5's 13 SAE rows; chan_feldman T2 column-squish) — out of Option-A scope, deferred by the user.
+- **Audit of the ~37 `_strip_phantom` th-stripped tables** — some title-shaped `<th>` strips may be wrongly-stripped REAL tables (pre-existing, predates this cycle). Needs its own verification cycle (render each, judge real-vs-phantom). Surfaced explicitly; NOT silently shipped around.

{docpluck-2.4.95 → docpluck-2.4.97}/.claude/skills/docpluck-qa/SKILL.md RENAMED Viewed

@@ -25,7 +25,7 @@ If QA surfaces an issue — any issue, however small, whether pre-existing, alre
 ## Project Context
 - **App repo (private):** `C:\Users\filin\Dropbox\Vibe\MetaScienceTools\PDFextractor` (GitHub: giladfeldman/docpluckapp)
-- **Library repo (public):** `C:\Users\filin\Dropbox\Vibe\docpluck` (GitHub: giladfeldman/docpluck, PyPI: docpluck)
+- **Library repo (public):** `C:\Users\filin\Dropbox\Vibe\MetaScienceTools\docpluck` (GitHub: giladfeldman/docpluck, PyPI: docpluck)
 - **Frontend:** Next.js 16 + Auth.js + Drizzle (in `frontend/`), port 6116
 - **Service:** Python FastAPI importing `docpluck` library (in `service/`), port 6117
 - **Database:** Neon Postgres (docpluck project)
@@ -607,6 +607,18 @@ print('F0 sentinel (preserved): PASS')
 "
 ```
+### 11b. Cross-Repo Library ↔ App Version Sync (CRITICAL — "when we bump the package, we bump the app")
+Asserts the app's docpluck git pin equals the library's latest released tag, so production never silently runs an old library. Reads the pin from docpluckapp **origin/master** (production-authoritative — a stale local clone shows an old pin even when prod is synced), so it fetches first.
+```bash
+cd C:\Users\filin\Dropbox\Vibe\MetaScienceTools\docpluck && python scripts/check_app_pin_sync.py
+```
+**Gate:** exit 0 (`PASS: in sync ...`). Exit 1 = the app pin lags the latest library tag — the `bump-app-pin.yml` auto-bump missed; recover NOW (re-push the tag to re-fire the workflow, or hand-bump `PDFextractor/service/requirements.txt` to `@v<latest>` and push to docpluckapp `master`), then re-run until PASS — never report the run clean while it lags. Exit 2 = could not reach docpluckapp `origin/master` (treat as FAIL, not PASS; offline dev only may pass `--allow-local-fallback`, which reads the local clone and prints a stale-clone warning). A working-tree `__version__` ahead of the latest tag is reported as UNRELEASED (not a failure) — tag + push it so the app auto-bumps.
+See CLAUDE.md "Two-Repo Architecture → Library ↔ app version sync".
 ### 12. Production Deployment (Vercel + Railway)
 ```bash
 # Vercel frontend
@@ -658,10 +670,11 @@ Opt-in cross-format benchmark suite --- DOCX corpus integrity, DOCX↔PDF parity
 | 9 | Database connectivity | PASS/FAIL | 7/7 tables |
 | 10 | Admin API | PASS/FAIL | health + stats |
 | 11 | Hard rules (4+2 checks) | PASS/FAIL | no -layout, no AGPL, U+2212, version, new modules, F0 sentinel |
+| 11b | Cross-repo lib↔app version sync | PASS/FAIL | app pin (origin/master) == latest library tag |
 | 12 | Production health | PASS/FAIL | HTTP codes |
 | 13 | ESCIcheck 10-PDF (production) | PASS/FAIL/SKIP | X/10 passed |
-**Overall: X/15 checks passed**
+**Overall: X/16 checks passed**
 ### Issues Found
 - [list any failures with exact error messages and file:line]

{docpluck-2.4.95 → docpluck-2.4.97}/.claude/skills/docpluck-qa/references/check-11-hard-rules.md RENAMED Viewed

@@ -3,7 +3,7 @@
 _Extracted from [../SKILL.md](../SKILL.md). Full procedure lives here._
 ```bash
-cd C:\Users\filin\Dropbox\Vibe\PDFextractor\service && python -c "
+cd C:\Users\filin\Dropbox\Vibe\MetaScienceTools\PDFextractor\service && python -c "
 import re
 # Rule 1: No -layout flag in pdftotext calls (check library)
@@ -30,15 +30,22 @@ print('Rule 2 (no AGPL): PASS')
 import docpluck.normalize as _nm_mod
 with open(_nm_mod.__file__, 'rb') as _f:
     _norm_bytes = _f.read()
-assert b'\\u2212' in _norm_bytes or b'\xe2\x88\x92' in _norm_bytes, 'U+2212 normalization missing'
+assert b'\xe2\x88\x92' in _norm_bytes, 'U+2212 normalization missing'  # U+2212 as UTF-8 bytes
 print('Rule 3 (U+2212 norm): PASS')
-# Rule 4: Library version is consistent
-import docpluck
-assert docpluck.__version__ == '1.4.5', f'Version mismatch: {docpluck.__version__}'
-print(f'Rule 4 (version=1.4.5): PASS')
+# Rule 4: Library version is internally consistent (__init__.py == pyproject.toml).
+# (Do NOT freeze a literal version here — it rots. This checks the two in-repo
+#  sources agree; the cross-repo app-pin sync is qa check 11b, see note below.)
+import docpluck, re, pathlib
+init_ver = docpluck.__version__
+pyproject = (pathlib.Path(docpluck.__file__).resolve().parent.parent / 'pyproject.toml').read_text(encoding='utf-8')
+proj_ver = re.search(r'(?m)^version\s*=\s*.(\d+\.\d+\.\d+)', pyproject).group(1)
+assert init_ver == proj_ver, f'Version mismatch: __init__={init_ver} pyproject={proj_ver}'
+print(f'Rule 4 (version consistency, __init__==pyproject=={init_ver}): PASS')
 "
 ```
+> **Cross-repo pin sync** (the app's `@v<VERSION>` pin == the library's latest released tag) is a *separate* gate — qa check **11b** and the review hard rule, both via `python scripts/check_app_pin_sync.py` (reads docpluckapp `origin/master`). Rule 4 above only checks the library's *internal* version consistency (`__init__.py` == `pyproject.toml`).
 ---

{docpluck-2.4.95 → docpluck-2.4.97}/.claude/skills/docpluck-review/SKILL.md RENAMED Viewed

@@ -213,6 +213,12 @@ The product principle is "one nightly server-resource-bounded cron per concern."
 - **Check:** parse `frontend/vercel.json`; assert exactly two cron entries, namely `/api/admin/blob-cleanup` at `0 3 * * *` AND `/api/cron/daily-digest` at `0 9 * * *`.
 - **Severity:** BLOCKER on any third cron — REQUEST CHANGES with the question "why isn't daily-digest enough?"
+### 22. Library ↔ app version pin sync (cross-repo, 2026-06-20 — "when we bump the package, we bump the app")
+The app imports the library via a git pin in `PDFextractor/service/requirements.txt` (`docpluck[all] @ git+...@v<VERSION>`). This pin — on docpluckapp **origin/master**, the production-authoritative source Railway deploys — MUST always equal the library's latest released `v*` tag. A lagging pin silently runs the OLD library in production. The `bump-app-pin.yml` workflow auto-bumps it on tag push, but it is best-effort and has missed silently, so review VERIFIES rather than assumes.
+- **Check:** run `python scripts/check_app_pin_sync.py` (from the docpluck repo). Exit 0 = synced. It fetches docpluckapp `origin/master` and compares the pin to the latest library tag — a stale local clone is NOT trusted (it shows an old pin even when prod is correctly synced).
+- **Check:** whenever the diff bumps `docpluck/__init__.py:__version__` / `pyproject.toml:version`, confirm a matching `v<VERSION>` tag will be / has been pushed so the auto-bump can fire — an untagged version bump leaves the app pinned behind. The script reports this as UNRELEASED.
+- **Severity:** BLOCKER if the app pin lags the latest released library tag (production runs the old library). WARN if a working-tree version bump is not yet tagged (release step pending).
 ## Review Checklist
 ### Python Service (`service/`)
@@ -266,6 +272,7 @@ The product principle is "one nightly server-resource-bounded cron per concern."
 - [ ] Changes reflected in CLAUDE.md if architectural
 - [ ] LESSONS.md updated if a new pitfall was discovered
 - [ ] TODO.md updated if features were added/completed
+- [ ] App pin in sync with library (hard rule 22): `python scripts/check_app_pin_sync.py` exits 0 (app pin on docpluckapp `origin/master` == latest library tag). BLOCKER if it lags.
 ## Output Format

{docpluck-2.4.95 → docpluck-2.4.97}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,31 @@
 # Changelog
+## [2.4.97] — 2026-06-22
+**Three table fixes shipped together (combined from two concurrent sessions): type the skipped p+df columns (DP-2), stop dropping / mis-binding two-header-row tables (DP-5), and stop the table raw_text fallback swallowing body prose (RC-T Layer-2).** `TABLE_EXTRACTION_VERSION` → `2.4.2`; no `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change. DP-2/DP-5 are render-visible in the inline flattened-table blocks + the `.tables.jsonl` sidecar `fields` (the `<table>` HTML gains the previously-dropped data rows); RC-T Layer-2 is render-visible in the `unstructured-table` fallback blocks. DP-2/DP-5 filed in `ESCIcheckapp/docs/DOCPLUCK_HANDOFF_2026-06-21.md`; RC-T Layer-2 per `docs/superpowers/specs/2026-06-21-rc-t-table-region-prose-contamination.md`.
+- **DP-2 — type the unlabeled p and df columns.** `tables.flatten._recover_blank_roles` recovered the leading test statistic and the `d [CI]` column of a header-stripped result table but left the bare p-value and df columns between them untyped, so `collabra.77859` Table 3 emitted `fields: {group, t, d, CI}` and dropped the `p` (`.551`) and `df` (`260.54`). A new Pass 4.5 types a still-blank column that is a bare `.XXX` with no comparison op as `p`, and a bare integer / Welch-decimal sitting between the test statistic and its `est/CI` column as `df` — keyed on data shape + position relative to the already-recovered roles, never bare position. The four Table-3 rows now carry `p` and `df`.
+- **DP-5 — two-row-header parallel-arm tables: recover the first data row and align centered super-headers.** `collabra.90203` Table 10 delivered only 5 of its 6 correlation rows (the Identifiable/Explicit-learning row was silently dropped), and the Original/Replication arms of `xiao_2021` Table 4 were swapped. Three coupled root-cause fixes: (a) `cell_cleaning._is_header_like_row` now counts APA value shapes (leading-dot decimal, bracketed CI, operator-prefixed p, `N/A`) as data via `_DATA_VALUE_CELL_RE`, so a real first data row is no longer mis-read as a third header row (the bracket branch requires a digit and no letters inside, so a genuine `[95% CI]` header stays a header); (b) `tables.flatten._detect_column_groups` re-derives arm boundaries from equal-width blocks of the data region — each must contain exactly one super-label — so a *centered* super-label (camelot stream loses colspan and folds it mid-span) no longer swaps arm values or pushes a stat column into the label region; left-aligned super-headers stay byte-identical; (c) `tables.flatten._classify_column` reads a folded super-header cell's role from its sub-part so a folded `…<sep>95% CI` column is still typed `CI`. Table 10 now emits all 6 conditions split into Target-article / Replication arms with correct `r` / `n` / `CI` / `p`; xiao Table 4 arms are no longer swapped; incidentally recovers `chan_feldman` Table 8 arm labels and `jama_open_2` Table 3 HR estimates + CIs.
+- **RC-T Layer-2 — stop the table raw_text fallback swallowing body prose.** When Camelot recovers no cells, `extract_structured._extract_table_body_text` linearises the text after a caption as the `unstructured-table` fallback; its per-line prose gate (`_line_is_body_prose`, len ≥ 80) misses prose that pdftotext WRAPPED into short (~48-char) lines, so a short table's caption-anchored region overshot the table end and swallowed Results/Discussion prose. Two FP-safe structural fixes: **(a) Note-anchor** — a table's `Note:` footnote is, by convention, its last element, so trim everything after the note paragraph (`chan_feldman` T1/T3 + `efendic_2022` T5 trailing Discussion prose removed; the stat rows + the note are kept); **(b) degenerate-prose guard** — suppress a fallback block that STARTS mid-sentence with a lowercase multi-letter word AND is majority sentence-shaped prose, so the renderer emits a clean caption-only table (`chan_feldman` T9 was an entire verbatim duplicate of `## Discussion` — now caption-only, no duplication). FP-safe by construction: real table cells start with a header / label / number / single-letter item marker, never a wrapped mid-sentence continuation — hypotheses ("a There is a positive association…"), descriptive rows ("Median age"), and instrument fragments are preserved. Keyed on the structural overshoot signature, never paper identity.
+Verification: new real-PDF + contract regression tests (`tests/test_tables_superheader_alignment_real_pdf.py`) — collabra.90203 T10 six-conditions/correct-arms + xiao T4 not-swapped (each FAILS at HEAD, PASSES after), plus `_is_header_like_row` / `_detect_column_groups` contract cases; `tests/test_tables_flatten_blank_header_recovery.py` extended for DP-2. A full-corpus (101-PDF) cached-table flatten diff confirms no clean-table regression — every changed table is a recovered row, a correct arm split, a recovered field, or a removed stat-less spurious row; already-garbage tables shuffle without a clean table regressing. Broad pytest green (real-PDF Camelot tests run serially per file — non-deterministic under cumulative load). RC-T Layer-2 adds `tests/test_rc_t_layer2_raw_text_real_pdf.py` (6 contract + 4 real-PDF: chan T1 Note-anchor, T9 suppress-no-duplication, T3 preserved) and an independent full-corpus 101-PDF guard-live-vs-bypassed raw_text diff (`grew=0 changed=0`; 4 trims + 8 prose-suppressions only). A 7-canary Sonnet AI-gold verify confirms every table this release touched is correct (chan T1/T3/T9, maier T10 six-conditions, xiao T4 arms) with no new TEXT-LOSS / HALLUCINATION.
+**Deferred (pre-existing, user decision 2026-06-22):** the remaining canary AI-verify FAILs are the architectural backlog, NOT regressions from this release — RC-T **Layer-1** table-data recovery (`table_areas`; e.g. plos_med Table 5's SAE rows, chan_feldman / chandrashekar under-extraction) and RC-1 two-column / sidebar column-interleave. Tracked in `docs/TRIAGE_2026-06-21_head_v2.4.95_assessment.md`; intentionally not addressed here.
+## [2.4.96] — 2026-06-21
+**RC-T (Option A): strip Camelot "tables" that are absorbed body prose, not data.** Render-only — `render.py::_strip_phantom_camelot_tables`; no `TABLE_EXTRACTION_VERSION` / `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change.
+Camelot runs free-form (`flavor="stream"`, no `table_areas`), so on a text-heavy page it returns a whole-page bbox and folds body prose into the `<thead>`. `_strip_phantom_camelot_tables` already drops such a table when a `<th>` is sentence-shaped prose (≥8 words, ≥3 function words, ≥2 verb-shape words) — but its function-word set was **missing `"the"`**, the single most common English function word, so a body-prose `<th>` with `fn=2 + "the"` slipped under the `fn≥3` bar and a garbage prose `<table>` survived. Two corpus cases: `10.1525/collabra.90203` (maier) Table 7 (`<th>` "Following the analyses conducted in Study 1 of Small") and `10.1080/02699931.2024.2434156` (chan_feldman) Table 6 ("associations between the six measures of interest: …").
+The fix counts `"the"`, but is **scoped to the marginal case** (a `<th>` that crosses fn 2→3 only because of "the") so every table already stripped at HEAD via the `fn≥3` path stays byte-identical, and is **gated on NOT being a title-leak**: `aom/amp_1` Table 5 leaks its own caption ("Improving Scholarly Impact … Practice") into the `<th>` over a REAL grid (Domains / Policymaking / Practice + data) — a caption-token-overlap test (≥60%) keeps such title-leak tables. **Net corpus impact: exactly 2 tables stripped, all others byte-identical** (verified by a full-corpus th-level scan over all 152 corpus PDFs + render diffs). Fail-clean: the `### Table N` heading + caption remain (table_parity preserved); the stripped prose is a Camelot duplicate, so the clean original survives in the body (no TEXT-LOSS). ip_feldman Table 10 — the RC-T canonical "orphan" — was already fail-clean at HEAD and is unchanged.
+Verification: 8 new real-PDF regression tests (`tests/test_rc_t_degenerate_table_real_pdf.py`): maier T7 + chan_feldman T6 stripped (each FAILS at HEAD, PASSES after), amp_1 T5 + chan_feldman T2/T5 (real tables) preserved, maier prose survives in body (no TEXT-LOSS), ip_feldman T10 prose stays out of `<table>`. Broad pytest green; 7-canary AI-verify shows the touched tables fail-clean with no new TEXT-LOSS / HALLUCINATION.
+**Deferred (separate cycles, surfaced not dropped):** (1) RC-T **Layer-1 recovery** — actually RECOVERING lost table data via tight `table_areas` (plos_med Table 5's 13 SAE rows; chan_feldman Table 2 column-squish) — out of Option-A scope. (2) **Audit of the ~37 corpus tables** stripped by the existing `_strip_phantom_camelot_tables` th-prose / section-token paths — some title-shaped `<th>` strips may be wrongly-stripped REAL tables (pre-existing; predates this change), needs its own verify-each-render cycle. (3) RC-1 two-column / sidebar interleave.
 ## [2.4.95] — 2026-06-20
 **Flatten now populates `fields` for non-clinical result tables (REQUEST_11).** `TABLE_EXTRACTION_VERSION` → `2.4.0`; no `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change. v2.4.94 solved the clinical PROSECCO table (labelled headers); this closes the two reproducers whose `fields` still came back `{}` — header-stripped result tables and tables packing parallel arms into single cells.

{docpluck-2.4.95 → docpluck-2.4.97}/CLAUDE.md RENAMED Viewed

@@ -50,16 +50,25 @@ docpluck[all] @ git+https://github.com/giladfeldman/docpluck.git@v<VERSION>
 When this library releases a new version, the app's `requirements.txt` git pin must be bumped or production silently keeps running the old library. The `/docpluck-deploy` skill's pre-flight check 4 enforces this.
+### Library ↔ app version sync (HARD RULE — when we bump the package, we bump the app; verify, don't assume)
+**Invariant: the app's docpluck pin and the library version are ALWAYS in sync.** The `@v<VERSION>` pin in `PDFextractor/service/requirements.txt` on **docpluckapp `origin/master`** (what Railway deploys) MUST equal the library's latest released `v*` tag. A lagging pin = production silently runs the old library. Bumping the package therefore *is* bumping the app — the two are never released independently.
+- **Mechanism (best-effort, automated):** `.github/workflows/bump-app-pin.yml` fires on every `v*.*.*` tag push to this repo and auto-commits the pin bump to docpluckapp `master`, which triggers the Railway redeploy. This usually "just works" — but it is **best-effort**: an Actions outage, an expired `APP_REPO_TOKEN`, or a regex drift can let it miss *silently* (it has happened). **Never assume the bump landed.**
+- **Verification (mandatory, deterministic):** run `python scripts/check_app_pin_sync.py` — it reads the pin from docpluckapp `origin/master` (production-authoritative, NOT your local clone) and compares it to the latest library tag. Exit 0 = synced. This gate is wired into `/docpluck-qa`, `/docpluck-review`, and `/docpluck-deploy` (pre-flight check 4); every release MUST pass it.
+- **A stale LOCAL clone lies.** A local `PDFextractor` checkout that hasn't fetched shows an *old* pin even when production is correctly synced — and almost causes a phantom "fix". ALWAYS verify against `origin/master` (the script does this for you); never judge sync from a local working-tree file.
+- **Recovery when it drifted:** re-push the tag (`git push origin v<VERSION>` re-fires the workflow), or hand-bump `service/requirements.txt` to `@v<VERSION>` and push to docpluckapp `master` (triggers Railway redeploy). Then re-run the gate and confirm Railway `/health` reports `docpluck_version == <VERSION>`.
 ## Release flow (library → production)
 1. Make + commit changes in this repo. Bump `__version__` (in `docpluck/__init__.py`), `version` (in `pyproject.toml`), and `NORMALIZATION_VERSION` (in `docpluck/normalize.py`) consistently.
 2. Update `CHANGELOG.md`.
 3. Push to `main`, then tag: `git tag v<VERSION> && git push --tags`.
 4. (Optional) Publish to PyPI: `python -m build && twine upload dist/*`.
-5. In `PDFextractor/service/requirements.txt`, bump the `@v<VERSION>` git pin and update any frozen version examples in `PDFextractor/API.md`.
-6. Run `/docpluck-deploy` from the docpluck repo — pre-flight check 4 verifies the pin matches.
+5. The tag push auto-fires `bump-app-pin.yml`, which bumps the `@v<VERSION>` pin in `PDFextractor/service/requirements.txt` on docpluckapp `master` and triggers the Railway redeploy. **This is best-effort — verify it landed:** run `python scripts/check_app_pin_sync.py` (exit 0 = synced against `origin/master`). If it drifted, recover per "Library ↔ app version sync" above. Also update any frozen version examples in `PDFextractor/API.md`.
+6. Run `/docpluck-deploy` from the docpluck repo — pre-flight check 4 runs the same sync gate and the post-deploy step confirms Railway `/health` reports the new version.
-Skipping step 5 is the most common failure mode. The deploy skill catches it.
+The most common failure mode is assuming the auto-bump landed when it silently missed — step 5's `check_app_pin_sync.py` gate catches it. The deploy skill, qa, and review all run it.
 ## Spike work queue (table-rendering iteration)
@@ -79,6 +88,7 @@ Skipping step 5 is the most common failure mode. The deploy skill catches it.
 - **NEVER call the Anthropic API. ALL Claude model calls go through Claude Max via Claude Code.** Allowed: `Agent` tool in-session (with `model="sonnet"` for the audit subagent); headless `claude -p --model sonnet` from `.git/hooks/*` and `tools/canary_audit.sh`; `mcp__scheduled-tasks__create_scheduled_task` invoking Claude Code. Forbidden: `import anthropic`, `ANTHROPIC_API_KEY` anywhere in this repo or any related repo (`docpluckapp`, `escicheck`, `2Rmarkdown`, `CitationGuard`), `.github/workflows/*` containing Anthropic-API calls. The canary-audit architecture (Sonnet-watches-Opus) is designed around this constraint: external enforcement is local git hooks + scheduled tasks invoking headless Claude Code, NOT GitHub Actions calling the API. Source: user directive 2026-05-25 (memory `feedback_no_apis_only_claude_max`), re-affirming previous statements across multiple sessions. Failure to follow this rule is the same severity as failing "LEAVE NOTHING BEHIND."
 - **LEAVE NOTHING BEHIND.** If you see an issue — any issue, however small, whether pre-existing, already-known, "out of scope", or unrelated to the task at hand — you fix it in the same run. "Pre-existing", "known", "not introduced by this change", and "out of scope" are NEVER grounds to leave a defect in place; noticing a defect and walking past it is itself a defect. Two — and only two — exceptions: **(a)** the fix needs a product or architecture decision only the user can make — surface it explicitly and immediately, never bury it; **(b)** the fix is genuinely too entangled to land in the current change — then it is queued as an *immediate next cycle in the same run*, never as "later", never as a handoff-doc footnote. Never end a task, cycle, or run with a known issue unaddressed. Established by user directive 2026-05-14, re-affirmed 2026-05-15, 2026-05-17, and **2026-05-19** ("doesn't matter pre-existing or not; this directive holds for all future runs, every skill"). This generalizes and strengthens the rule-0e family (memory `feedback_fix_every_bug_found`). See the prominent top-of-file statement under "Working directive — LEAVE NOTHING BEHIND".
+- **KEEP THE APP PIN IN SYNC WITH THE LIBRARY — when we bump the package, we bump the app.** The `@v<VERSION>` docpluck pin in `PDFextractor/service/requirements.txt` (on docpluckapp `origin/master`, the production-authoritative source) MUST always equal the library's latest released `v*` tag; a lagging pin silently runs the old library in production. `bump-app-pin.yml` auto-bumps it on tag push but is **best-effort and has missed silently** — never assume, always verify with `python scripts/check_app_pin_sync.py` (exit 0 = synced). A stale LOCAL clone shows an old pin even when prod is synced, so the gate reads `origin/master`, never a local file. Wired into `/docpluck-qa`, `/docpluck-review`, `/docpluck-deploy`. Full mechanism + recovery under "Two-Repo Architecture → Library ↔ app version sync". Established by user directive 2026-06-20.
 - **EVERY FIX MUST BE GENERAL — serve all future PDFs, never a one-PDF quick-hack.** docpluck is a meta-science tool that processes arbitrary academic PDFs across many publishers. Every change must be keyed on a STRUCTURAL SIGNATURE — a typographic pattern, layout invariant, glyph-corruption shape, section-structure rule — never on paper identity, filename, or a string hard-coded from one PDF. A change that resolves one paper's quirk but risks regressions on others is the WRONG fix; find the general root cause. Regression tests use specific PDF fixtures, but the fix *logic* must generalize to any PDF with the same structural signature. Always run the full 26-paper baseline to confirm no regression; widen verification (broad-read, more AI-golds) when a fix touches a shared code path. Established by user directive 2026-05-15. See memory `feedback_general_fixes_not_pdf_specific`.
 - **NEVER swap the PDF text-extraction tool as a fix for downstream problems.** The TEXT channel is `extract_pdf` (pdftotext default mode); the LAYOUT channel is `extract_pdf_layout` (pdfplumber).  They are not interchangeable text sources.  Sections / normalize / batch consume the text channel; tables / figures / F0-layout-strip consume the layout channel.  Real-world-paper bugs (watermarks in body, abstract not detected, column interleaving) must be fixed in the layer that owns the artifact (`normalize.py` W0, `sections/annotators/text.py`, `sections/taxonomy.py`, `sections/core.py`) — not by switching extraction tools.  See [LESSONS.md L-001](./LESSONS.md#l-001--never-swap-the-pdf-text-extraction-tool-as-a-fix-for-downstream-problems) for the full incident record.
 - **NEVER use pdftotext with `-layout` flag** — causes column interleaving. See `docpluck/extract.py:13–16` and [LESSONS.md L-002](./LESSONS.md#l-002--never-use-pdftotext--layout-flag).

{docpluck-2.4.95 → docpluck-2.4.97}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.95
+Version: 2.4.97
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://docpluck.app
 Project-URL: Documentation, https://docpluck.app/api-docs

{docpluck-2.4.95 → docpluck-2.4.97}/TODO.md RENAMED Viewed

@@ -2,6 +2,24 @@
 This file tracks future-aim items that are scoped out of the current milestone but should not be lost. See `docs/superpowers/specs/` for active specs.
+## 2026-06-21 — v2.4.95 corpus assessment (7/7 canary FAIL) — RC-B7 done, RC-T spec'd + handed off, RC-1 next
+> Full real AI-verify of 7 canaries at HEAD v2.4.95 = **7/7 FAIL** on 3 architectural root causes (the canary-audit hook's `AUDIT_DEFERRED→union PASS` had masked it again — `feedback_canary_audit_clobbers_phase5d`). Canonical queue: [`docs/TRIAGE_2026-06-21_head_v2.4.95_assessment.md`](docs/TRIAGE_2026-06-21_head_v2.4.95_assessment.md). Branch `feat/rc-t-table-region-guard` (commits `db7192b`, `927d869`).
+- [x] **RC-B7 deleted-minus glyph — DONE (W0h).** Already implemented as `normalize.recover_dropped_minus_via_layout` (wired render.py:5079→normalize.py:3170, tested `tests/test_dropped_minus_layout_recovery_real_pdf.py`); HEAD recovers 4/5 ar_apa betas. Residuals (`.245` pixel-minus, β→b) are OCR-tier won't-fix (both pdftotext AND pdfplumber agree on the wrong glyph). The cycle-3 ar_apa "FAIL" was a verifier over-flag.
+- [ ] **RC-T table-region prose contamination — SPEC'd, implementation handed off.** Widest defect (all 7 papers): a small table among prose gets a near-full-page region (ip_feldman T10 region 71→331 reaches into the Discussion prose) and the whitespace clusterer turns prose into "rows" → garbage cells, orphan `### Table N`, empty shells, duplicate dumps. Two layers: (1) Camelot free-form (no `table_areas`) → whole-page bbox; (2) `caption + SEARCH_BELOW_PT(250)` region with no table-END detection. Fix = region prose-trim + degenerate-region guard, keyed on **cell content not bbox-size**. Spec: [`docs/superpowers/specs/2026-06-21-rc-t-table-region-prose-contamination.md`](docs/superpowers/specs/2026-06-21-rc-t-table-region-prose-contamination.md). Handoff: see `docs/superpowers/handoffs/2026-06-21-*`.
+- [ ] **RC-1 region-aware columns — after RC-T.** Two-column + PLOS-sidebar interleave (chan_feldman, chandrashekar, plos_med front-matter-before-Abstract). Step-2 band path exists ship-dark; see the existing RC-1 entry below + `docs/superpowers/specs/2026-06-08-rc1-region-aware-column-architecture.md`.
+- [ ] **article-finder: correct the `collabra.77859` gold table-numbering** (gold says "Table 2"; the source PDF caption + docpluck + the consumer all say "Table 3" — gold error, see below).
+## 2026-06-20 — v2.4.95 SHIPPED (REQUEST_11: flatten fields for non-clinical result tables) — deferred follow-ups
+> ✅ **Shipped + verified in prod.** main `370b89c` + tag `v2.4.95` (canary PASS), PyPI published, app pin auto-bumped to `@v2.4.95`, Railway `/health` reports `docpluck_version: 2.4.95`. Closes ESCImate REQUEST_11 (blank-header column-role recovery + packed parallel-arm split + general U+2212/bracket-CI fixes). All 4 acceptance criteria met, both target papers AI-gold-verified. Details: `REPLY_FROM_DOCPLUCK_v2.4.95.md`, `docs/HANDOFF_2026-06-20_request11_flatten_nonclinical_tables.md`, `CHANGELOG.md`.
+- [ ] **`fields.effect_type` — opt-in, only if ESCImate requests it.** REQUEST_11 §2.4 asked for `effect_type` (cohens_d / pearson_r / partial_eta_squared / mean_difference) but called it "not a blocker." Deferred because emitting it would add a key to PROSECCO's 6 rows, conflicting with acceptance #4 (PROSECCO byte-identical). Offered as an opt-in in the reply; implement (grounded in key-present + effect vocab) only if the consumer accepts the PROSECCO field-set change.
+- [x] **(ADJUDICATED 2026-06-21 — docpluck CORRECT, gold mis-numbered) `collabra.77859` caption-number "Table 3" vs gold "Table 2".** Resolved via the source text channel: the PDF's own caption at text-line 866 reads verbatim `Table 3. Study 4: Dish sets` (tables run 1→5 sequentially: L282 T1, L680 T2, **L866 T3**, L869 T4, L1034 T5). docpluck binds "Table 3" exactly matching the source; the consumer also calls it Table 3. **The AI gold mis-numbered it "Table 2"** — gold error, not a docpluck defect. → surface to article-finder for gold correction (logged above).
+- [x] **(no action) `collabra.90203` Table 10 "Joint/No-explicit" r=.59 vs gold .63** — the PDF *text layer* encodes `.59` (pdftotext AND Camelot agree); only the AI-visual gold sees `.63`. Source text-layer corruption, undetectable without OCR (not allowed). Documented for the consumer; nothing docpluck can fix.
+- [x] **(handled) Value-exact real-PDF flatten tests flake under `-n10`** — Camelot extraction is non-deterministic under parallel load; the 4 `*_real_pdf` tests now skip under `PYTEST_XDIST_WORKER` (run serially in canonical QA), matching the `test_benchmark_docx_html.py` convention. Synthetic-grid contract tests cover the logic under `-n10`.
 ## 2026-06-16 — deferred for investigation before code changes
 - [ ] **Investigate `sections=` extraction de-dup (no behavior change yet).** `extract_pdf(..., sections=...)`, `extract_docx(..., sections=...)`, and `extract_html(..., sections=...)` currently do one extraction pass and then call `extract_sections(...)`, which can re-run extraction/annotation internally by design. Before optimizing, document invariants proving parity with direct `extract_sections(...)` outputs, then run corpus/harness verification to confirm zero regressions. No implementation change until those proofs are in place.

{docpluck-2.4.95 → docpluck-2.4.97}/docpluck/__init__.py RENAMED Viewed

@@ -78,7 +78,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.95"
+__version__ = "2.4.97"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.95 → docpluck-2.4.97}/docpluck/extract_structured.py RENAMED Viewed

@@ -37,7 +37,7 @@ from .tables.render import cells_to_html
 from .telemetry import record_fallback
-TABLE_EXTRACTION_VERSION = "2.4.0"  # v2.4.0 (REQUEST_11): flatten now populates fields for NON-clinical result tables — (a) blank-header column-role recovery (tables.flatten._recover_blank_roles): assign a stat role to a header-stripped column from its data-token SHAPE (CI brackets, df1/df2 pair, estimate-adjacent-CI, p-with-operator) AND caption/footnote/all-header-rows vocabulary, never bare position; recovers collabra.77859 T5 (t/df/d/CI) + collabra.90203 T8/T9 (F/df/p/BF01/eta²p-as-est/CI). (b) packed parallel-arm split (tables.flatten._detect_packed_arms/_flatten_packed_arms): tables packing k≥2 arms into single cells ("Separate Joint" + space-joined values) emit one typed record per arm (group=arm) — collabra.77859 T3 Separate/Joint, xiao_2021 T7 Regret/Justifiability. (c) new BF01 role; validity guards drop r∉[-1,1] / non-monotone CI / non-int n / p∉[0,1]. (d) GENERAL L-004 fixes: _parse_number + _parse_ci_cell fold U+2212 MINUS (negative t/d/CI bounds in Camelot cells were dropped/sign-lost); _VALUE_GROUP_RE handles bracket-led CI groups. Default render + PROSECCO output byte-identical. # v2.3.0 (Tier-2, REQUEST_10): cross-flavor lattice-augmentation — recover data rows a lattice extraction vertically TRUNCATED by appending the rows a same-page, same-column-count stream table captured below the lattice bbox (camelot_extract._augment_lattice_with_stream_rows), gated on equal-col-count + bbox overlap + extends-below; PLUS numeric/parenthetical continuation merge (cell_cleaning._merge_continuation_rows) rejoining stream's stacked value/parenthetical cells. Fixes PROSECCO Table 2 R2-R6. v2.2.0: EC-T1 docpluck.tables.flatten — per-row FlattenedRow records (sentence + structured fields) for downstream stat-verification consumers (effectcheck/escimate/scimeto) + opt-in inline "rendered as text" block below each <table> via render_pdf_to_markdown(flatten_tables_inline=True). v2.1.5: cell-cleaning recovers CMEX10 extensible-bracket PUA glyphs (U+F8EE-F8FB). v2.1.4: cell-cleaning recovers Adobe-Symbol-font PUA glyphs (beta/chi/bullet as U+F0xx). v2.1.3: cell-cleaning recovers '<'-as-backslash glyph corruption. v2.1.2: cell-cleaning recovers descending-CI '2'-for-minus corruption. v2.1.1: cell-cleaning recovers (cid:0) corrupted minus signs + strips math-alphanumeric styling. v2.1.0: cell-cleaning pipeline ported from splice spike (multi-row header detection, continuation merging, leader-dot strip, mash-split, group separators, sig-marker attach)
+TABLE_EXTRACTION_VERSION = "2.4.2"  # v2.4.2 (RC-T Layer-2): _extract_table_body_text now (a) Note-anchor — a table's "Note:" footnote is its last element, so trim body prose bled past it (chan_feldman T1/T3, efendic_2022 T5); and (b) degenerate-prose guard — suppress a raw_text fallback that STARTS mid-sentence with a lowercase multi-letter word AND is majority sentence-shaped prose, so render emits a clean caption-only table instead of an unstructured-table dump duplicating Results/Discussion prose (chan_feldman T9 was a verbatim ## Discussion duplicate). FP-safe (real cells start with header/label/number/single-letter marker, never a wrapped continuation); full-corpus 101-PDF guard-diff only trims+suppresses (grew=0 changed=0). # v2.4.1 (DP-2/DP-5): (DP-2) blank-header role recovery now types the unlabeled p-value (a bare `.XXX` after the test stat, no comparison op) and df (a bare integer/Welch-decimal between the stat and the d[CI] column) columns it previously skipped — collabra.77859 T3 fields gain p+df (tables.flatten._recover_blank_roles Pass 4.5). (DP-5) parallel-arm tables with a TWO-ROW header no longer drop their first data row, and a CENTERED super-header is aligned to its arm block instead of its visual-center column: (a) cell_cleaning._is_header_like_row counts APA value shapes (leading-dot decimal, bracketed CI, operator-prefixed p, N/A) as data via _DATA_VALUE_CELL_RE so a real first data row isn't read as a 3rd header row (collabra.90203 T10 recovered the Identifiable/Explicit-learning correlation); (b) tables.flatten._detect_column_groups re-derives arm boundaries from equal-width blocks of the data region (each must hold one super-label) so a centered super-label folded mid-span no longer swaps arm values (xiao_2021 T4 Original/Replication F) or pushes a stat column into the label region; (c) tables.flatten._classify_column reads a folded super-header cell's role from its sub-part (collabra.90203 T10 CI). Full-corpus cached-table flatten diff: no clean-table regression. # v2.4.0 (REQUEST_11): flatten now populates fields for NON-clinical result tables — (a) blank-header column-role recovery (tables.flatten._recover_blank_roles): assign a stat role to a header-stripped column from its data-token SHAPE (CI brackets, df1/df2 pair, estimate-adjacent-CI, p-with-operator) AND caption/footnote/all-header-rows vocabulary, never bare position; recovers collabra.77859 T5 (t/df/d/CI) + collabra.90203 T8/T9 (F/df/p/BF01/eta²p-as-est/CI). (b) packed parallel-arm split (tables.flatten._detect_packed_arms/_flatten_packed_arms): tables packing k≥2 arms into single cells ("Separate Joint" + space-joined values) emit one typed record per arm (group=arm) — collabra.77859 T3 Separate/Joint, xiao_2021 T7 Regret/Justifiability. (c) new BF01 role; validity guards drop r∉[-1,1] / non-monotone CI / non-int n / p∉[0,1]. (d) GENERAL L-004 fixes: _parse_number + _parse_ci_cell fold U+2212 MINUS (negative t/d/CI bounds in Camelot cells were dropped/sign-lost); _VALUE_GROUP_RE handles bracket-led CI groups. Default render + PROSECCO output byte-identical. # v2.3.0 (Tier-2, REQUEST_10): cross-flavor lattice-augmentation — recover data rows a lattice extraction vertically TRUNCATED by appending the rows a same-page, same-column-count stream table captured below the lattice bbox (camelot_extract._augment_lattice_with_stream_rows), gated on equal-col-count + bbox overlap + extends-below; PLUS numeric/parenthetical continuation merge (cell_cleaning._merge_continuation_rows) rejoining stream's stacked value/parenthetical cells. Fixes PROSECCO Table 2 R2-R6. v2.2.0: EC-T1 docpluck.tables.flatten — per-row FlattenedRow records (sentence + structured fields) for downstream stat-verification consumers (effectcheck/escimate/scimeto) + opt-in inline "rendered as text" block below each <table> via render_pdf_to_markdown(flatten_tables_inline=True). v2.1.5: cell-cleaning recovers CMEX10 extensible-bracket PUA glyphs (U+F8EE-F8FB). v2.1.4: cell-cleaning recovers Adobe-Symbol-font PUA glyphs (beta/chi/bullet as U+F0xx). v2.1.3: cell-cleaning recovers '<'-as-backslash glyph corruption. v2.1.2: cell-cleaning recovers descending-CI '2'-for-minus corruption. v2.1.1: cell-cleaning recovers (cid:0) corrupted minus signs + strips math-alphanumeric styling. v2.1.0: cell-cleaning pipeline ported from splice spike (multi-row header detection, continuation merging, leader-dot strip, mash-split, group separators, sig-marker attach)
 TableTextMode = Literal["raw", "placeholder"]
@@ -1306,6 +1306,74 @@ def _line_is_body_prose(line: str) -> bool:
     return stopwords_hit >= 4
+def _join_wrapped_lines(lines: list[str]) -> list[str]:
+    """Merge pdftotext-wrapped lines into logical paragraphs.
+    pdftotext linearizes a flowing prose paragraph into several short
+    (~45-60 char) lines; the per-line ``_line_is_body_prose`` gate
+    (len >= 80) cannot see prose in that wrapped form. Joining a line with
+    the next whenever it does not end on sentence-terminal punctuation
+    reconstructs the paragraph so prose can be measured at paragraph scale.
+    """
+    paras: list[str] = []
+    cur = ""
+    for ln in lines:
+        s = ln.strip()
+        if not s:
+            continue
+        cur = (cur + " " + s).strip() if cur else s
+        if s.endswith((".", "!", "?", ":")):
+            paras.append(cur)
+            cur = ""
+    if cur:
+        paras.append(cur)
+    return paras
+def _raw_text_is_degenerate_prose(text: str) -> bool:
+    """True if a table raw_text fallback is dominated by flowing body prose.
+    RC-T Layer-2 (v2.4.97). When Camelot recovers no cells AND the
+    caption-anchored region has no extractable table text near the caption,
+    the body_start walk lands INSIDE a prose paragraph and the fallback
+    swallows Results/Discussion prose (which is then duplicated under its
+    real section heading). Such a block must be suppressed (render then
+    emits a clean caption-only table) rather than dumped verbatim.
+    FP-safe by construction — fires only when BOTH hold:
+      (a) the block STARTS mid-sentence: its first line begins with a
+          lowercase multi-letter continuation word. A real table's
+          linearized cells start with a column header, label, number, or a
+          single-letter item marker (``a``/``b``/``c``) — never a wrapped
+          mid-paragraph continuation like "than empathy. We provided ...".
+      (b) the joined block is majority (>= 60% of chars) sentence-shaped
+          body prose.
+    Legitimate degraded tables are preserved: hypotheses ("a There is a
+    positive association ..."), descriptive rows ("Median age (years)"),
+    instrument items ("h et al., 1997)") all fail (a). Keyed purely on the
+    structural overshoot signature, never on paper identity.
+    """
+    lines = [ln for ln in text.split("\n") if ln.strip()]
+    if len(lines) < 4:
+        return False
+    first_tokens = lines[0].split()
+    first_word = first_tokens[0] if first_tokens else ""
+    starts_midsentence = (
+        len(first_word) >= 2
+        and first_word[0].islower()
+        and first_word[0].isalpha()
+    )
+    if not starts_midsentence:
+        return False
+    paragraphs = _join_wrapped_lines(lines)
+    total = sum(len(p) for p in paragraphs)
+    if total == 0:
+        return False
+    prose = sum(len(p) for p in paragraphs if _line_is_body_prose(p))
+    return prose >= 0.6 * total
 def _extract_table_body_text(
     raw_text: str,
     cap: CaptionMatch,
@@ -1379,6 +1447,31 @@ def _extract_table_body_text(
             break
         kept.append(ln)
+    # Note-anchor table-end (RC-T Layer-2, v2.4.97). A table's "Note:" /
+    # "Notes:" footnote is, by academic-table convention, its LAST element.
+    # Any text after the note paragraph is body prose that bled past the
+    # table boundary — the caption-anchored region overshot the table end
+    # and the per-line `_line_is_body_prose` gate (len >= 80) misses prose
+    # that pdftotext WRAPPED into short (~48-char) lines, so it accumulates
+    # here. Trim everything after the note's (possibly wrapped) paragraph.
+    # This is FP-safe: legitimate table cells (hypotheses a/b/c, instrument
+    # items) appear BEFORE the note; nothing legitimate follows it. Keyed on
+    # the structural "Note: ... <sentence end>" signature, never paper
+    # identity. `^Notes?[.:]` requires punctuation so body prose that merely
+    # starts with the word "Note that ..." does not false-trigger.
+    note_idx = next(
+        (i for i, ln in enumerate(kept)
+         if re.match(r"^\s*Notes?[.:]", ln.strip())),
+        None,
+    )
+    if note_idx is not None and not os.environ.get("DOCPLUCK_RCT_L2_BYPASS"):
+        note_end = note_idx
+        for k in range(note_idx, len(kept)):
+            note_end = k
+            if kept[k].strip().endswith((".", "!", "?")):
+                break
+        kept = kept[: note_end + 1]
     # Trim trailing heading-like short lines that don't belong to this table
     # (the start of the next section). Two patterns are trimmed:
     #   * Title-Case headings without a sentence terminator
@@ -1414,7 +1507,17 @@ def _extract_table_body_text(
         s = re.sub(r"[ \t]+", " ", ln).strip()
         if s:
             cleaned_lines.append(s)
-    return "\n".join(cleaned_lines).strip()
+    result = "\n".join(cleaned_lines).strip()
+    # Degenerate-prose guard (RC-T Layer-2, v2.4.97): drop a raw_text
+    # fallback that is really body prose the region overshot into, so the
+    # renderer emits a clean caption-only table instead of an
+    # ``unstructured-table`` dump that duplicates Results/Discussion prose.
+    # ``DOCPLUCK_RCT_L2_BYPASS`` reverts both Layer-2 additions (Note-anchor
+    # + this guard) to HEAD behavior — used only by the FP-scan harness to
+    # diff guard-live vs guard-bypassed over the full corpus.
+    if not os.environ.get("DOCPLUCK_RCT_L2_BYPASS") and _raw_text_is_degenerate_prose(result):
+        return ""
+    return result
 def _figure_from_caption(

{docpluck-2.4.95 → docpluck-2.4.97}/docpluck/render.py RENAMED Viewed

@@ -2696,6 +2696,20 @@ def _strip_phantom_camelot_tables(text: str) -> str:
     # Process each <table>...</table> block independently.
     pattern = re.compile(r"<table\b[^>]*>.*?</table>\s*", re.DOTALL | re.IGNORECASE)
+    # RC-T (v2.4.96): token sets of the document's italic caption lines (``*…*``).
+    # Used by the scoped "the" body-prose path below to EXCLUDE a <th> that is
+    # the table's own TITLE leaked into the header (a title-leak on a REAL table,
+    # e.g. amp_1 Table 5) rather than absorbed body prose — high caption-token
+    # overlap == title-leak == keep. Built once per render; empty when a paper
+    # has no captions (then the scoped path simply never excludes).
+    _caption_tok_sets = [
+        toks for toks in (
+            set(re.findall(r"[a-z]{2,}", c.lower()))
+            for c in re.findall(r"\*([^*\n]{12,}?)\*", text)
+        )
+        if len(toks) >= 4
+    ]
     def is_phantom(block: str) -> bool:
         # Extract <th> contents.
         th_cells = re.findall(r"<th[^>]*>(.*?)</th>", block, re.DOTALL | re.IGNORECASE)
@@ -2752,6 +2766,31 @@ def _strip_phantom_camelot_tables(text: str) -> str:
                 if fn_count >= 3 and verb_count >= 2:
                     th_section_leak = True
                     break
+                # RC-T (v2.4.96, 2026-06-21): "the" — the single most common
+                # English function word — is absent from _FUNCTION_WORDS_IN_PROSE,
+                # so a body-prose <th> with exactly fn_count==2 + "the" slipped
+                # under the bar: maier_2023 Table 7 "Following the analyses
+                # conducted in Study 1 of Small" (fn=in,of; verb=following,
+                # conducted) and chan_feldman Table 6 "associations between the
+                # six measures of interest: …". This NEW path counts "the" to push
+                # such ths over — but is SCOPED to the marginal case (fn_count<3
+                # without "the", >=3 with it) so every table already stripped at
+                # HEAD via the fn>=3 path stays byte-identical, AND gated on NOT
+                # being a title-leak: amp_1 Table 5 leaks its own caption
+                # ("Improving Scholarly Impact … Practice") into the <th> over a
+                # REAL grid — stripping it would destroy a real table. A title
+                # leak shares most of its tokens with a caption; body prose
+                # shares ~none.
+                the_count = sum(1 for w in words if w == "the")
+                if fn_count < 3 and (fn_count + the_count) >= 3 and verb_count >= 2:
+                    th_toks = set(re.findall(r"[a-z]{2,}", cleaned_th.lower()))
+                    is_title_leak = bool(th_toks) and any(
+                        len(th_toks & cap) / len(th_toks) >= 0.6
+                        for cap in _caption_tok_sets
+                    )
+                    if not is_title_leak:
+                        th_section_leak = True
+                        break
                 word_set = set(words)
                 if any(t.lower() in word_set for t in _PHANTOM_TABLE_BODY_LEAK_TOKENS):
                     if verb_count >= 1:

{docpluck-2.4.95 → docpluck-2.4.97}/docpluck/tables/cell_cleaning.py RENAMED Viewed

@@ -393,13 +393,32 @@ _NUMERIC_CELL_RE = re.compile(
     r"^[-−–]?\d+(?:[.,]\d+)*(?:[%∗*]+)?(?:\s*\([^)]*\))?$"
 )
+# A cell carrying a statistic VALUE (vs a header label). Broader than
+# _NUMERIC_CELL_RE: also matches APA leading-dot decimals (".34"), operator-
+# prefixed p-values ("< .001"), bracketed numeric intervals ("[0.53, 0.72]"),
+# and the "N/A" filler — all DATA, not header text. The interval branch requires
+# a digit and NO letters inside the brackets so a genuine header cell like
+# "[95% CI]" (letters present) is NOT counted as data and stays a header. Used by
+# `_is_header_like_row` so a real data row whose APA-formatted values the bare
+# numeric pattern under-counted is not mistaken for an extra header row — the
+# bug that silently dropped the FIRST data row of two-header-row correlation
+# tables (collabra.90203 Table 10, DP-5).
+_DATA_VALUE_CELL_RE = re.compile(
+    r"^(?:"
+    r"[<>=]?\s*[-−–]?\d*[.,]?\d+(?:[.,]\d+)*(?:[%∗*]+)?(?:\s*\([^)]*\))?"
+    r"|\[[^\]A-Za-z]*\d[^\]A-Za-z]*\]"
+    r"|n\s*/?\s*a"
+    r")$",
+    re.I,
+)
 def _is_header_like_row(row: list[str]) -> bool:
     """Heuristic: a row that looks like part of a header rather than data."""
     nonempty = [c.strip() for c in row if (c or "").strip()]
     if not nonempty:
         return False
-    numeric = sum(1 for c in nonempty if _NUMERIC_CELL_RE.match(c))
+    numeric = sum(1 for c in nonempty if _DATA_VALUE_CELL_RE.match(c))
     if numeric / len(nonempty) > 0.3:
         return False
     avg_len = sum(len(c) for c in nonempty) / len(nonempty)

docpluck 2.4.95__tar.gz → 2.4.97__tar.gz

docpluck 2.4.95tar.gz → 2.4.97tar.gz