PyPI - docpluck - Versions diffs - 2.4.96__tar.gz → 2.4.97__tar.gz - Mend

docpluck 2.4.96tar.gz → 2.4.97tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (436) hide show

{docpluck-2.4.96 → docpluck-2.4.97}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,19 @@
 # Changelog
+## [2.4.97] — 2026-06-22
+**Three table fixes shipped together (combined from two concurrent sessions): type the skipped p+df columns (DP-2), stop dropping / mis-binding two-header-row tables (DP-5), and stop the table raw_text fallback swallowing body prose (RC-T Layer-2).** `TABLE_EXTRACTION_VERSION` → `2.4.2`; no `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change. DP-2/DP-5 are render-visible in the inline flattened-table blocks + the `.tables.jsonl` sidecar `fields` (the `<table>` HTML gains the previously-dropped data rows); RC-T Layer-2 is render-visible in the `unstructured-table` fallback blocks. DP-2/DP-5 filed in `ESCIcheckapp/docs/DOCPLUCK_HANDOFF_2026-06-21.md`; RC-T Layer-2 per `docs/superpowers/specs/2026-06-21-rc-t-table-region-prose-contamination.md`.
+- **DP-2 — type the unlabeled p and df columns.** `tables.flatten._recover_blank_roles` recovered the leading test statistic and the `d [CI]` column of a header-stripped result table but left the bare p-value and df columns between them untyped, so `collabra.77859` Table 3 emitted `fields: {group, t, d, CI}` and dropped the `p` (`.551`) and `df` (`260.54`). A new Pass 4.5 types a still-blank column that is a bare `.XXX` with no comparison op as `p`, and a bare integer / Welch-decimal sitting between the test statistic and its `est/CI` column as `df` — keyed on data shape + position relative to the already-recovered roles, never bare position. The four Table-3 rows now carry `p` and `df`.
+- **DP-5 — two-row-header parallel-arm tables: recover the first data row and align centered super-headers.** `collabra.90203` Table 10 delivered only 5 of its 6 correlation rows (the Identifiable/Explicit-learning row was silently dropped), and the Original/Replication arms of `xiao_2021` Table 4 were swapped. Three coupled root-cause fixes: (a) `cell_cleaning._is_header_like_row` now counts APA value shapes (leading-dot decimal, bracketed CI, operator-prefixed p, `N/A`) as data via `_DATA_VALUE_CELL_RE`, so a real first data row is no longer mis-read as a third header row (the bracket branch requires a digit and no letters inside, so a genuine `[95% CI]` header stays a header); (b) `tables.flatten._detect_column_groups` re-derives arm boundaries from equal-width blocks of the data region — each must contain exactly one super-label — so a *centered* super-label (camelot stream loses colspan and folds it mid-span) no longer swaps arm values or pushes a stat column into the label region; left-aligned super-headers stay byte-identical; (c) `tables.flatten._classify_column` reads a folded super-header cell's role from its sub-part so a folded `…<sep>95% CI` column is still typed `CI`. Table 10 now emits all 6 conditions split into Target-article / Replication arms with correct `r` / `n` / `CI` / `p`; xiao Table 4 arms are no longer swapped; incidentally recovers `chan_feldman` Table 8 arm labels and `jama_open_2` Table 3 HR estimates + CIs.
+- **RC-T Layer-2 — stop the table raw_text fallback swallowing body prose.** When Camelot recovers no cells, `extract_structured._extract_table_body_text` linearises the text after a caption as the `unstructured-table` fallback; its per-line prose gate (`_line_is_body_prose`, len ≥ 80) misses prose that pdftotext WRAPPED into short (~48-char) lines, so a short table's caption-anchored region overshot the table end and swallowed Results/Discussion prose. Two FP-safe structural fixes: **(a) Note-anchor** — a table's `Note:` footnote is, by convention, its last element, so trim everything after the note paragraph (`chan_feldman` T1/T3 + `efendic_2022` T5 trailing Discussion prose removed; the stat rows + the note are kept); **(b) degenerate-prose guard** — suppress a fallback block that STARTS mid-sentence with a lowercase multi-letter word AND is majority sentence-shaped prose, so the renderer emits a clean caption-only table (`chan_feldman` T9 was an entire verbatim duplicate of `## Discussion` — now caption-only, no duplication). FP-safe by construction: real table cells start with a header / label / number / single-letter item marker, never a wrapped mid-sentence continuation — hypotheses ("a There is a positive association…"), descriptive rows ("Median age"), and instrument fragments are preserved. Keyed on the structural overshoot signature, never paper identity.
+Verification: new real-PDF + contract regression tests (`tests/test_tables_superheader_alignment_real_pdf.py`) — collabra.90203 T10 six-conditions/correct-arms + xiao T4 not-swapped (each FAILS at HEAD, PASSES after), plus `_is_header_like_row` / `_detect_column_groups` contract cases; `tests/test_tables_flatten_blank_header_recovery.py` extended for DP-2. A full-corpus (101-PDF) cached-table flatten diff confirms no clean-table regression — every changed table is a recovered row, a correct arm split, a recovered field, or a removed stat-less spurious row; already-garbage tables shuffle without a clean table regressing. Broad pytest green (real-PDF Camelot tests run serially per file — non-deterministic under cumulative load). RC-T Layer-2 adds `tests/test_rc_t_layer2_raw_text_real_pdf.py` (6 contract + 4 real-PDF: chan T1 Note-anchor, T9 suppress-no-duplication, T3 preserved) and an independent full-corpus 101-PDF guard-live-vs-bypassed raw_text diff (`grew=0 changed=0`; 4 trims + 8 prose-suppressions only). A 7-canary Sonnet AI-gold verify confirms every table this release touched is correct (chan T1/T3/T9, maier T10 six-conditions, xiao T4 arms) with no new TEXT-LOSS / HALLUCINATION.
+**Deferred (pre-existing, user decision 2026-06-22):** the remaining canary AI-verify FAILs are the architectural backlog, NOT regressions from this release — RC-T **Layer-1** table-data recovery (`table_areas`; e.g. plos_med Table 5's SAE rows, chan_feldman / chandrashekar under-extraction) and RC-1 two-column / sidebar column-interleave. Tracked in `docs/TRIAGE_2026-06-21_head_v2.4.95_assessment.md`; intentionally not addressed here.
 ## [2.4.96] — 2026-06-21
 **RC-T (Option A): strip Camelot "tables" that are absorbed body prose, not data.** Render-only — `render.py::_strip_phantom_camelot_tables`; no `TABLE_EXTRACTION_VERSION` / `NORMALIZATION_VERSION` / `SECTIONING_VERSION` change.

{docpluck-2.4.96 → docpluck-2.4.97}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpluck
-Version: 2.4.96
+Version: 2.4.97
 Summary: PDF, DOCX, and HTML text extraction and normalization for academic papers
 Project-URL: Homepage, https://docpluck.app
 Project-URL: Documentation, https://docpluck.app/api-docs

{docpluck-2.4.96 → docpluck-2.4.97}/docpluck/__init__.py RENAMED Viewed

@@ -78,7 +78,7 @@ from .figures import Figure
 from .extract_structured import TABLE_EXTRACTION_VERSION, StructuredResult, extract_pdf_structured
 from .render import render_pdf_to_markdown
-__version__ = "2.4.96"
+__version__ = "2.4.97"
 __author__ = "Gilad Feldman"
 __license__ = "MIT"

{docpluck-2.4.96 → docpluck-2.4.97}/docpluck/extract_structured.py RENAMED Viewed

@@ -37,7 +37,7 @@ from .tables.render import cells_to_html
 from .telemetry import record_fallback
-TABLE_EXTRACTION_VERSION = "2.4.0"  # v2.4.0 (REQUEST_11): flatten now populates fields for NON-clinical result tables — (a) blank-header column-role recovery (tables.flatten._recover_blank_roles): assign a stat role to a header-stripped column from its data-token SHAPE (CI brackets, df1/df2 pair, estimate-adjacent-CI, p-with-operator) AND caption/footnote/all-header-rows vocabulary, never bare position; recovers collabra.77859 T5 (t/df/d/CI) + collabra.90203 T8/T9 (F/df/p/BF01/eta²p-as-est/CI). (b) packed parallel-arm split (tables.flatten._detect_packed_arms/_flatten_packed_arms): tables packing k≥2 arms into single cells ("Separate Joint" + space-joined values) emit one typed record per arm (group=arm) — collabra.77859 T3 Separate/Joint, xiao_2021 T7 Regret/Justifiability. (c) new BF01 role; validity guards drop r∉[-1,1] / non-monotone CI / non-int n / p∉[0,1]. (d) GENERAL L-004 fixes: _parse_number + _parse_ci_cell fold U+2212 MINUS (negative t/d/CI bounds in Camelot cells were dropped/sign-lost); _VALUE_GROUP_RE handles bracket-led CI groups. Default render + PROSECCO output byte-identical. # v2.3.0 (Tier-2, REQUEST_10): cross-flavor lattice-augmentation — recover data rows a lattice extraction vertically TRUNCATED by appending the rows a same-page, same-column-count stream table captured below the lattice bbox (camelot_extract._augment_lattice_with_stream_rows), gated on equal-col-count + bbox overlap + extends-below; PLUS numeric/parenthetical continuation merge (cell_cleaning._merge_continuation_rows) rejoining stream's stacked value/parenthetical cells. Fixes PROSECCO Table 2 R2-R6. v2.2.0: EC-T1 docpluck.tables.flatten — per-row FlattenedRow records (sentence + structured fields) for downstream stat-verification consumers (effectcheck/escimate/scimeto) + opt-in inline "rendered as text" block below each <table> via render_pdf_to_markdown(flatten_tables_inline=True). v2.1.5: cell-cleaning recovers CMEX10 extensible-bracket PUA glyphs (U+F8EE-F8FB). v2.1.4: cell-cleaning recovers Adobe-Symbol-font PUA glyphs (beta/chi/bullet as U+F0xx). v2.1.3: cell-cleaning recovers '<'-as-backslash glyph corruption. v2.1.2: cell-cleaning recovers descending-CI '2'-for-minus corruption. v2.1.1: cell-cleaning recovers (cid:0) corrupted minus signs + strips math-alphanumeric styling. v2.1.0: cell-cleaning pipeline ported from splice spike (multi-row header detection, continuation merging, leader-dot strip, mash-split, group separators, sig-marker attach)
+TABLE_EXTRACTION_VERSION = "2.4.2"  # v2.4.2 (RC-T Layer-2): _extract_table_body_text now (a) Note-anchor — a table's "Note:" footnote is its last element, so trim body prose bled past it (chan_feldman T1/T3, efendic_2022 T5); and (b) degenerate-prose guard — suppress a raw_text fallback that STARTS mid-sentence with a lowercase multi-letter word AND is majority sentence-shaped prose, so render emits a clean caption-only table instead of an unstructured-table dump duplicating Results/Discussion prose (chan_feldman T9 was a verbatim ## Discussion duplicate). FP-safe (real cells start with header/label/number/single-letter marker, never a wrapped continuation); full-corpus 101-PDF guard-diff only trims+suppresses (grew=0 changed=0). # v2.4.1 (DP-2/DP-5): (DP-2) blank-header role recovery now types the unlabeled p-value (a bare `.XXX` after the test stat, no comparison op) and df (a bare integer/Welch-decimal between the stat and the d[CI] column) columns it previously skipped — collabra.77859 T3 fields gain p+df (tables.flatten._recover_blank_roles Pass 4.5). (DP-5) parallel-arm tables with a TWO-ROW header no longer drop their first data row, and a CENTERED super-header is aligned to its arm block instead of its visual-center column: (a) cell_cleaning._is_header_like_row counts APA value shapes (leading-dot decimal, bracketed CI, operator-prefixed p, N/A) as data via _DATA_VALUE_CELL_RE so a real first data row isn't read as a 3rd header row (collabra.90203 T10 recovered the Identifiable/Explicit-learning correlation); (b) tables.flatten._detect_column_groups re-derives arm boundaries from equal-width blocks of the data region (each must hold one super-label) so a centered super-label folded mid-span no longer swaps arm values (xiao_2021 T4 Original/Replication F) or pushes a stat column into the label region; (c) tables.flatten._classify_column reads a folded super-header cell's role from its sub-part (collabra.90203 T10 CI). Full-corpus cached-table flatten diff: no clean-table regression. # v2.4.0 (REQUEST_11): flatten now populates fields for NON-clinical result tables — (a) blank-header column-role recovery (tables.flatten._recover_blank_roles): assign a stat role to a header-stripped column from its data-token SHAPE (CI brackets, df1/df2 pair, estimate-adjacent-CI, p-with-operator) AND caption/footnote/all-header-rows vocabulary, never bare position; recovers collabra.77859 T5 (t/df/d/CI) + collabra.90203 T8/T9 (F/df/p/BF01/eta²p-as-est/CI). (b) packed parallel-arm split (tables.flatten._detect_packed_arms/_flatten_packed_arms): tables packing k≥2 arms into single cells ("Separate Joint" + space-joined values) emit one typed record per arm (group=arm) — collabra.77859 T3 Separate/Joint, xiao_2021 T7 Regret/Justifiability. (c) new BF01 role; validity guards drop r∉[-1,1] / non-monotone CI / non-int n / p∉[0,1]. (d) GENERAL L-004 fixes: _parse_number + _parse_ci_cell fold U+2212 MINUS (negative t/d/CI bounds in Camelot cells were dropped/sign-lost); _VALUE_GROUP_RE handles bracket-led CI groups. Default render + PROSECCO output byte-identical. # v2.3.0 (Tier-2, REQUEST_10): cross-flavor lattice-augmentation — recover data rows a lattice extraction vertically TRUNCATED by appending the rows a same-page, same-column-count stream table captured below the lattice bbox (camelot_extract._augment_lattice_with_stream_rows), gated on equal-col-count + bbox overlap + extends-below; PLUS numeric/parenthetical continuation merge (cell_cleaning._merge_continuation_rows) rejoining stream's stacked value/parenthetical cells. Fixes PROSECCO Table 2 R2-R6. v2.2.0: EC-T1 docpluck.tables.flatten — per-row FlattenedRow records (sentence + structured fields) for downstream stat-verification consumers (effectcheck/escimate/scimeto) + opt-in inline "rendered as text" block below each <table> via render_pdf_to_markdown(flatten_tables_inline=True). v2.1.5: cell-cleaning recovers CMEX10 extensible-bracket PUA glyphs (U+F8EE-F8FB). v2.1.4: cell-cleaning recovers Adobe-Symbol-font PUA glyphs (beta/chi/bullet as U+F0xx). v2.1.3: cell-cleaning recovers '<'-as-backslash glyph corruption. v2.1.2: cell-cleaning recovers descending-CI '2'-for-minus corruption. v2.1.1: cell-cleaning recovers (cid:0) corrupted minus signs + strips math-alphanumeric styling. v2.1.0: cell-cleaning pipeline ported from splice spike (multi-row header detection, continuation merging, leader-dot strip, mash-split, group separators, sig-marker attach)
 TableTextMode = Literal["raw", "placeholder"]
@@ -1306,6 +1306,74 @@ def _line_is_body_prose(line: str) -> bool:
     return stopwords_hit >= 4
+def _join_wrapped_lines(lines: list[str]) -> list[str]:
+    """Merge pdftotext-wrapped lines into logical paragraphs.
+    pdftotext linearizes a flowing prose paragraph into several short
+    (~45-60 char) lines; the per-line ``_line_is_body_prose`` gate
+    (len >= 80) cannot see prose in that wrapped form. Joining a line with
+    the next whenever it does not end on sentence-terminal punctuation
+    reconstructs the paragraph so prose can be measured at paragraph scale.
+    """
+    paras: list[str] = []
+    cur = ""
+    for ln in lines:
+        s = ln.strip()
+        if not s:
+            continue
+        cur = (cur + " " + s).strip() if cur else s
+        if s.endswith((".", "!", "?", ":")):
+            paras.append(cur)
+            cur = ""
+    if cur:
+        paras.append(cur)
+    return paras
+def _raw_text_is_degenerate_prose(text: str) -> bool:
+    """True if a table raw_text fallback is dominated by flowing body prose.
+    RC-T Layer-2 (v2.4.97). When Camelot recovers no cells AND the
+    caption-anchored region has no extractable table text near the caption,
+    the body_start walk lands INSIDE a prose paragraph and the fallback
+    swallows Results/Discussion prose (which is then duplicated under its
+    real section heading). Such a block must be suppressed (render then
+    emits a clean caption-only table) rather than dumped verbatim.
+    FP-safe by construction — fires only when BOTH hold:
+      (a) the block STARTS mid-sentence: its first line begins with a
+          lowercase multi-letter continuation word. A real table's
+          linearized cells start with a column header, label, number, or a
+          single-letter item marker (``a``/``b``/``c``) — never a wrapped
+          mid-paragraph continuation like "than empathy. We provided ...".
+      (b) the joined block is majority (>= 60% of chars) sentence-shaped
+          body prose.
+    Legitimate degraded tables are preserved: hypotheses ("a There is a
+    positive association ..."), descriptive rows ("Median age (years)"),
+    instrument items ("h et al., 1997)") all fail (a). Keyed purely on the
+    structural overshoot signature, never on paper identity.
+    """
+    lines = [ln for ln in text.split("\n") if ln.strip()]
+    if len(lines) < 4:
+        return False
+    first_tokens = lines[0].split()
+    first_word = first_tokens[0] if first_tokens else ""
+    starts_midsentence = (
+        len(first_word) >= 2
+        and first_word[0].islower()
+        and first_word[0].isalpha()
+    )
+    if not starts_midsentence:
+        return False
+    paragraphs = _join_wrapped_lines(lines)
+    total = sum(len(p) for p in paragraphs)
+    if total == 0:
+        return False
+    prose = sum(len(p) for p in paragraphs if _line_is_body_prose(p))
+    return prose >= 0.6 * total
 def _extract_table_body_text(
     raw_text: str,
     cap: CaptionMatch,
@@ -1379,6 +1447,31 @@ def _extract_table_body_text(
             break
         kept.append(ln)
+    # Note-anchor table-end (RC-T Layer-2, v2.4.97). A table's "Note:" /
+    # "Notes:" footnote is, by academic-table convention, its LAST element.
+    # Any text after the note paragraph is body prose that bled past the
+    # table boundary — the caption-anchored region overshot the table end
+    # and the per-line `_line_is_body_prose` gate (len >= 80) misses prose
+    # that pdftotext WRAPPED into short (~48-char) lines, so it accumulates
+    # here. Trim everything after the note's (possibly wrapped) paragraph.
+    # This is FP-safe: legitimate table cells (hypotheses a/b/c, instrument
+    # items) appear BEFORE the note; nothing legitimate follows it. Keyed on
+    # the structural "Note: ... <sentence end>" signature, never paper
+    # identity. `^Notes?[.:]` requires punctuation so body prose that merely
+    # starts with the word "Note that ..." does not false-trigger.
+    note_idx = next(
+        (i for i, ln in enumerate(kept)
+         if re.match(r"^\s*Notes?[.:]", ln.strip())),
+        None,
+    )
+    if note_idx is not None and not os.environ.get("DOCPLUCK_RCT_L2_BYPASS"):
+        note_end = note_idx
+        for k in range(note_idx, len(kept)):
+            note_end = k
+            if kept[k].strip().endswith((".", "!", "?")):
+                break
+        kept = kept[: note_end + 1]
     # Trim trailing heading-like short lines that don't belong to this table
     # (the start of the next section). Two patterns are trimmed:
     #   * Title-Case headings without a sentence terminator
@@ -1414,7 +1507,17 @@ def _extract_table_body_text(
         s = re.sub(r"[ \t]+", " ", ln).strip()
         if s:
             cleaned_lines.append(s)
-    return "\n".join(cleaned_lines).strip()
+    result = "\n".join(cleaned_lines).strip()
+    # Degenerate-prose guard (RC-T Layer-2, v2.4.97): drop a raw_text
+    # fallback that is really body prose the region overshot into, so the
+    # renderer emits a clean caption-only table instead of an
+    # ``unstructured-table`` dump that duplicates Results/Discussion prose.
+    # ``DOCPLUCK_RCT_L2_BYPASS`` reverts both Layer-2 additions (Note-anchor
+    # + this guard) to HEAD behavior — used only by the FP-scan harness to
+    # diff guard-live vs guard-bypassed over the full corpus.
+    if not os.environ.get("DOCPLUCK_RCT_L2_BYPASS") and _raw_text_is_degenerate_prose(result):
+        return ""
+    return result
 def _figure_from_caption(

{docpluck-2.4.96 → docpluck-2.4.97}/docpluck/tables/cell_cleaning.py RENAMED Viewed

@@ -393,13 +393,32 @@ _NUMERIC_CELL_RE = re.compile(
     r"^[-−–]?\d+(?:[.,]\d+)*(?:[%∗*]+)?(?:\s*\([^)]*\))?$"
 )
+# A cell carrying a statistic VALUE (vs a header label). Broader than
+# _NUMERIC_CELL_RE: also matches APA leading-dot decimals (".34"), operator-
+# prefixed p-values ("< .001"), bracketed numeric intervals ("[0.53, 0.72]"),
+# and the "N/A" filler — all DATA, not header text. The interval branch requires
+# a digit and NO letters inside the brackets so a genuine header cell like
+# "[95% CI]" (letters present) is NOT counted as data and stays a header. Used by
+# `_is_header_like_row` so a real data row whose APA-formatted values the bare
+# numeric pattern under-counted is not mistaken for an extra header row — the
+# bug that silently dropped the FIRST data row of two-header-row correlation
+# tables (collabra.90203 Table 10, DP-5).
+_DATA_VALUE_CELL_RE = re.compile(
+    r"^(?:"
+    r"[<>=]?\s*[-−–]?\d*[.,]?\d+(?:[.,]\d+)*(?:[%∗*]+)?(?:\s*\([^)]*\))?"
+    r"|\[[^\]A-Za-z]*\d[^\]A-Za-z]*\]"
+    r"|n\s*/?\s*a"
+    r")$",
+    re.I,
+)
 def _is_header_like_row(row: list[str]) -> bool:
     """Heuristic: a row that looks like part of a header rather than data."""
     nonempty = [c.strip() for c in row if (c or "").strip()]
     if not nonempty:
         return False
-    numeric = sum(1 for c in nonempty if _NUMERIC_CELL_RE.match(c))
+    numeric = sum(1 for c in nonempty if _DATA_VALUE_CELL_RE.match(c))
     if numeric / len(nonempty) > 0.3:
         return False
     avg_len = sum(len(c) for c in nonempty) / len(nonempty)

{docpluck-2.4.96 → docpluck-2.4.97}/docpluck/tables/flatten.py RENAMED Viewed

@@ -261,6 +261,17 @@ def _classify_column(header: str) -> Optional[str]:
     h = (header or "").strip()
     if not h:
         return None
+    # A folded super-header cell ("Replication\x00BR\x0095% CI") carries the GROUP
+    # label in the super-part and the column's OWN role in the sub-part. Classify
+    # on the sub-part first (then the super-part) so a folded CI / p / stat column
+    # is still recognized — otherwise the whole "Replication…95% CI" string never
+    # matches and the column's role is lost (collabra.90203 T10 CI, DP-5).
+    if _MERGE_SEPARATOR in h:
+        for part in reversed([p.strip() for p in h.split(_MERGE_SEPARATOR) if p.strip()]):
+            role = _classify_column(part)
+            if role:
+                return role
+        return None
     # Strip a single trailing punct (`,`, `:`, `.`) that some PDFs include.
     h = h.rstrip(",:.")
     for role, pat in _ROLE_PATTERNS:
@@ -477,6 +488,18 @@ def _looks_like_p(v: str) -> bool:
     return bool(_P_SHAPE_RE.match(v or "") or _NA_RE.match(v or ""))
+# A sub-one decimal: a value in [0, 1) written APA-style (".551", "0.03") with an
+# optional comparison op ("<.001"). Unlike `_P_SHAPE_RE` it rejects an integer
+# part ≥ 1 (so a test statistic like "1.31" is NOT mistaken for a p-value). Used
+# to separate a still-blank p column (sub-one) from a still-blank df / n column
+# (values ≥ 1) when both are unlabeled and adjacent. (Pass 3.5, DP-2.)
+_SUB_ONE_DEC_RE = re.compile(r"^\s*[<>=]?\s*0?\.\d+\s*$")
+def _looks_like_sub_one(v: str) -> bool:
+    return bool(_SUB_ONE_DEC_RE.match(v or ""))
 def _has_comparison_op(v: str) -> bool:
     return "<" in (v or "") or ">" in (v or "")
@@ -800,6 +823,54 @@ def _recover_blank_roles(
             if _frac_match(vals, _is_num_or_na) and (ci + 1) in ci_cols:
                 override[ci] = "est"
+    # Pass 4.5 — p / df (or n) recovery for an established t/F/r results table.
+    # Once the statistic column is typed (Pass 3 / grid) and the table carries a
+    # recognized effect/CI column, the still-blank bare-numeric columns BETWEEN
+    # the statistic and that interval are the p-value and the df/n: p is a sub-one
+    # decimal (".551", "0.03", "<.001") with no integer part; df/n is a bare
+    # number ≥ 1 (Welch "260.54", integer "131"). This types the operator-less p
+    # that Pass 1 defers and the mixed-integer/decimal df that Pass 2 (all-integer
+    # only) skips — both unambiguous HERE because position (after the statistic,
+    # before the interval) pins them. Runs AFTER the caption-run pass so a leaked
+    # header always wins, and BEFORE Pass 5 so a real df is not stolen as an
+    # est-adjacent point estimate. Keyed on structure, never paper identity.
+    # (DP-2: collabra.77859 Separate/Joint t-tests dropped p + Welch df.)
+    if family in ("t", "F", "r"):
+        stat_col = next(
+            (ci for ci in cols if (override.get(ci) or grid_role[ci]) == family),
+            None,
+        )
+        if stat_col is not None:
+            right_bound = min(
+                (
+                    ci
+                    for ci in cols
+                    if ci > stat_col
+                    and grid_role[ci] in ("est_ci", "CI", "CI_lo", "CI_hi", "est")
+                ),
+                default=n,
+            )
+            present_roles = {grid_role[ci] for ci in cols if grid_role[ci]} | set(
+                override.values()
+            )
+            has_p = "p" in present_roles
+            for ci in cols:
+                if grid_role[ci] or ci in override:
+                    continue
+                if not (stat_col < ci < right_bound):
+                    continue
+                vals = _column_values(body, ci)
+                if not _frac_match(vals, _is_num_or_na):
+                    continue
+                if not has_p and _frac_match(vals, _looks_like_sub_one):
+                    override[ci] = "p"
+                    has_p = True
+                elif _frac_match(
+                    vals,
+                    lambda v: bool(_BARE_NUM_RE.match(v)) and (_parse_number(v) or 0) >= 1,
+                ):
+                    override[ci] = "n" if family == "r" else "df"
     # Pass 5 — final est-adjacency sweep for tables with no caption run: a
     # still-blank bare-number column immediately left of a CI column is the
     # interval's point estimate.
@@ -1164,10 +1235,55 @@ def _detect_column_groups(
     starts = [i for i, h in enumerate(header) if _MERGE_SEPARATOR in (h or "")]
     if len(starts) < 2:
         return None
+    n = len(header)
+    # The sentinel marks where camelot PLACED each super-label — but a *centered*
+    # spanning label (colspan is lost in stream extraction) lands mid-span, not at
+    # its arm's first column, so trusting the sentinel as the arm boundary
+    # mis-bins columns: collabra.90203 T10 puts the Target-article "r" into the
+    # label region, and xiao_2021 T4 splits Original/Replication with the F values
+    # swapped. Re-derive arm boundaries from EQUAL-WIDTH blocks of the data region
+    # (the columns between the leading + trailing non-stat label columns), each of
+    # which must contain exactly one super-label. Falls back to the literal
+    # sentinel boundaries when the region does not divide evenly — so every
+    # previously-grouped table stays byte-identical unless this strictly corrects
+    # its alignment (a left-aligned super-header already at the block start yields
+    # the identical grouping). General, keyed on structure, not paper id. (DP-5.)
+    def _is_label_col(i: int) -> bool:
+        return i not in starts and not _classify_column(header[i])
+    lead = 0
+    while lead < n and _is_label_col(lead):
+        lead += 1
+    trail = n - 1
+    while trail >= 0 and _is_label_col(trail):
+        trail -= 1
+    width = trail - lead + 1
+    k = len(starts)
+    if (
+        width >= k
+        and width % k == 0
+        and lead <= starts[0]
+        and starts[-1] <= trail
+    ):
+        block = width // k
+        blocks = [(lead + j * block, lead + (j + 1) * block - 1) for j in range(k)]
+        if all(b_lo <= s <= b_hi for (b_lo, b_hi), s in zip(blocks, starts)):
+            label_cols = [i for i in range(n) if i < blocks[0][0] or i > blocks[-1][1]]
+            groups = [
+                (
+                    (header[s].split(_MERGE_SEPARATOR, 1)[0]).strip(),
+                    list(range(b_lo, b_hi + 1)),
+                )
+                for (b_lo, b_hi), s in zip(blocks, starts)
+            ]
+            return label_cols, groups
+    # Fallback: literal sentinel-boundary grouping (pre-existing behavior).
     label_cols = list(range(0, starts[0]))
-    groups: list[tuple[str, list[int]]] = []
+    groups = []
     for gi, start in enumerate(starts):
-        end = starts[gi + 1] if gi + 1 < len(starts) else len(header)
+        end = starts[gi + 1] if gi + 1 < len(starts) else n
         glabel = (header[start].split(_MERGE_SEPARATOR, 1)[0]).strip()
         groups.append((glabel, list(range(start, end))))
     return label_cols, groups

docpluck-2.4.97/docs/superpowers/handoffs/2026-06-22-dp2-dp5-flatten-fixes-commit.md ADDED Viewed

@@ -0,0 +1,71 @@
+# DP-2 + DP-5 flatten fixes — commit/coordination handoff (2026-06-22)
+## 1. Goal
+Commit the **DP-2 + DP-5 table-flatten fixes** (already complete + verified, working tree, v2.4.97) as a clean commit on top of the concurrent session's RC-T render-guard commit `84a4d42` (v2.4.96) — staging **only the 8 listed files** — without colliding with the parallel session that shares this working tree.
+## 2. Why it matters
+docpluck is a meta-science tool; a dropped or mis-bound table row silently corrupts downstream stat verification (effectcheck/ESCImate). DP-2 and DP-5 came from a real consumer handoff (`ESCIcheckapp/docs/DOCPLUCK_HANDOFF_2026-06-21.md`). **Two Claude sessions are editing this one working tree concurrently** (the other committed `84a4d42` mid-work), so the commit must stage explicit paths — a `git add -A`/`git add .` from either session would sweep the other's unfinished work into the wrong commit (memory `release-version-collision-with-parallel-uncommitted-stream`).
+## 3. State at handoff
+- Branch: `feat/rc-t-table-region-guard`
+- HEAD commit: `84a4d42` (`fix(render): RC-T — strip Camelot tables that are absorbed body prose (v2.4.96)`) — committed by the **concurrent session** (Gilad Feldman, 2026-06-22 07:56), not this one.
+- Committed in this session: **none** (this session did not commit, per the "commit only when asked" rule).
+- Uncommitted (this session's DP-2/DP-5 work, all on top of `84a4d42`):
+  - `docpluck/tables/flatten.py` — DP-2 Pass 4.5 (type blank `p`/`df`); DP-5 `_classify_column` sentinel-aware + `_detect_column_groups` equal-width-block arm alignment
+  - `docpluck/tables/cell_cleaning.py` — DP-5 `_DATA_VALUE_CELL_RE` + `_is_header_like_row` data-value recognition
+  - `docpluck/extract_structured.py` — `TABLE_EXTRACTION_VERSION` `2.4.0` → `2.4.1` (DP-2/DP-5 note)
+  - `docpluck/__init__.py` — `__version__` `2.4.96` → `2.4.97`
+  - `pyproject.toml` — `version` `2.4.96` → `2.4.97`
+  - `CHANGELOG.md` — new `[2.4.97]` entry
+  - `tests/test_tables_flatten_blank_header_recovery.py` — DP-2 tests added (`test_separate_arm_p_and_df`, `test_joint_arm_p_and_integer_df`, packed-arms p/df assertions)
+  - `tests/test_tables_superheader_alignment_real_pdf.py` — **new** (DP-5 real-PDF + contract tests)
+- Working artifacts (gitignored, safe to ignore/delete): `tmp/repro_dp.py`, `tmp/dbg_dp2.py`, `tmp/cache_and_flatten.py`, `tmp/flat_mine.json`, `tmp/flat_head.json`, `tmp/tblcache/*.json`.
+## 4. What's done (verified)
+- **DP-2 — 77859 Table 3 `fields` now include `p` + `df`.** `_recover_blank_roles` Pass 4.5 types the operator-less `.XXX` p column and the integer/Welch-decimal df column it previously skipped. Verified: the DP-2 tests **fail at HEAD** (`KeyError: 'p'`) and **pass** after; `flatten_table` on the live PDF yields `p=.551, df=260.54` for the Separate arm.
+- **DP-5 — 90203 Table 10 emits all 6 conditions, correctly arm-split.** Handoff blamed Camelot, but Camelot extracted all 6 rows — `flatten` dropped the first data row (mistook it for a 3rd header row) then mis-bound the centered super-header. Three coupled fixes (header-detection, block-alignment, folded-header classify). Verified: Table 10 → 12 rows (6 conditions × Target/Replication) with the **exact** handoff values (`r=.63, n=170, CI [0.53,0.72]` for Identifiable/Explicit); rendered `.md` `<table>` shows all 6 rows; real-PDF tests fail-at-HEAD/pass-after.
+- **Incidental correct improvements** (same root cause): `xiao_2021` T4 Original/Replication F **un-swapped** (was wrong at HEAD — a canary), `chan_feldman` T8 arm labels recovered (canary), `jama_open_2` T3 HR estimates+CIs recovered.
+- **No clean-table regression**: full-corpus (101-PDF) **deterministic cached-table flatten diff** (mine vs HEAD) — every change is a recovered row, a correct arm split, a recovered field, or a removed stat-less spurious row; already-garbage tables (chen T9, aom amd_2, ieee T10) shuffle but no clean table regressed. 285 contract tests pass; both touched test files pass per-file (superheader 7/7, flatten_blank_header 27/27).
+## 5. What's next (numbered, concrete)
+1. **Coordinate with the concurrent session first.** Confirm the other session has finished writing to the working tree (or have it commit/stash its own files) so the two change-sets don't interleave. Both sets are on `84a4d42`; they touch disjoint files (theirs: `render.py`; mine: `flatten.py`/`cell_cleaning.py`), so they compose cleanly.
+2. **Commit DP-2 + DP-5 as v2.4.97, staging only these 8 files** (never `git add -A`):
+   ```bash
+   git add docpluck/__init__.py pyproject.toml docpluck/extract_structured.py \
+           docpluck/tables/cell_cleaning.py docpluck/tables/flatten.py \
+           tests/test_tables_flatten_blank_header_recovery.py \
+           tests/test_tables_superheader_alignment_real_pdf.py CHANGELOG.md
+   git commit -m "fix(tables): DP-2 type blank p/df + DP-5 two-header-row recovery & super-header alignment (v2.4.97)"
+   ```
+   (Optional: split into two commits — DP-2 = `flatten.py` Pass 4.5 + the `test_tables_flatten_blank_header_recovery.py` additions; DP-5 = `cell_cleaning.py` + the rest of `flatten.py` + the new test file — if independent revertability is wanted. The version-bump files go with whichever commit ships last.)
+3. **Before any `git tag v2.4.97` / release**: run the formal Sonnet canary AI-verify (the project's keystone gate — `references/ai-full-doc-verify.md`) on the touched canaries (`90203` maier, `xiao`, `chan_feldman`) against the article-finder golds, and the 26-paper baseline. Tagging fires `bump-app-pin.yml`, so do not tag until that gate is green (run `python scripts/check_app_pin_sync.py` after).
+4. **Architectural backlog — leave as documented backlog** (user decision 2026-06-22). Do NOT start RC-T Layer-1 recovery or RC-1 default-flip this stream. The four remaining handoff defects map to existing tracked work (see §6 / §8).
+## 6. Open decisions
+- **Tag/release v2.4.97 now, or batch with later table work?** Options: (A) tag now — ships the fixes to the app via the pin bump, but each tag is a Railway redeploy; (B) leave committed-but-untagged and batch with the next table cycle. **Recommendation: (B)** — the concurrent session's v2.4.96 is also untagged on this branch; tag once when the branch's table work is consolidated, after the formal canary AI-verify. Confirm with the user.
+- **Commit granularity (1 vs 2 commits).** Recommendation: **1 commit** — DP-2 and DP-5 are both from the same handoff, ship together as v2.4.97, and were verified together; the CHANGELOG documents both. Split only if you specifically want per-defect revertability.
+## 7. Watchouts
+- **Shared working tree (live collision risk).** The other session committed `84a4d42` while this session worked. NEVER `git add -A`/`git add .` — stage the 8 explicit paths only. Verify `git status` shows nothing unexpected staged before committing.
+- **Real-PDF Camelot tests flake under *cumulative* load — even serially.** Running 13 Camelot-heavy test files in one `pytest` process flaked 9 ("no tables extracted"); each passes per-file. So the canonical `pytest tests/ -q` whole-suite run is unreliable for the table real-PDF tests — run them **per file** (or in small batches) to gate. This is a pre-existing infra issue (`test_rc_t_degenerate_table_real_pdf.py` docstring notes the xdist variant; this extends it to serial cumulative load). Not fixed here.
+- **DP-5 was misdiagnosed in the source handoff** ("Camelot drops a row" → actually a `flatten` header-miscount). Reproducing at HEAD before coding is what caught it (memory `reproduce-triage-defect-at-head-before-trusting-cost-estimate`). Apply the same to DP-1/DP-6 before assuming they're Layer-1.
+- **Version already bumped in the working tree** (`__version__`/`pyproject` → 2.4.97, `TABLE_EXTRACTION_VERSION` → 2.4.1). Don't double-bump.
+- **No formal Sonnet AI-gold verify run yet** (only deterministic + the consumer-handoff's AI-derived expected values, which my output matches exactly). The project rule is AI-gold is the verdict — run it before tagging (§5 step 3).
+- **`NORMALIZATION_VERSION` / `SECTIONING_VERSION` intentionally NOT bumped** — this change is table-flatten only.
+## 8. Context pointers
+- Source defect list (the 6 DP defects): `../../../../ESCIcheckapp/docs/DOCPLUCK_HANDOFF_2026-06-21.md`
+- Living work queue (RC-T / RC-1 architectural backlog = DP-1/3/4/6): `docs/TRIAGE_2026-06-21_head_v2.4.95_assessment.md`
+- RC-T spec (DP-1/DP-6 Layer-1 recovery is the out-of-scope follow-on): `docs/superpowers/specs/2026-06-21-rc-t-table-region-prose-contamination.md`
+- RC-1 spec (DP-3/DP-4 interleave; banded flag exists, default-OFF pending Step-2 polish): `docs/superpowers/specs/2026-06-08-rc1-region-aware-column-architecture.md`
+- This session's tests: `tests/test_tables_superheader_alignment_real_pdf.py`, `tests/test_tables_flatten_blank_header_recovery.py`
+- Release/AI-verify gate: `.claude/skills/docpluck-iterate/references/ai-full-doc-verify.md`; app-pin gate `scripts/check_app_pin_sync.py`
+- Memories: `feedback_docpluck_app_pin_sync` (verify origin/master before/after tag), `feedback_canary_audit_clobbers_phase5d` (don't trust AUDIT_DEFERRED PASS), `feedback_general_fixes_not_pdf_specific`, `release-version-collision-with-parallel-uncommitted-stream`
+### Architectural backlog (DP-1/3/4/6 — left as documented backlog per user decision 2026-06-22)
+| Defect | Paper | Maps to | Status |
+|---|---|---|---|
+| DP-1 | 77859 Table 1/2 not extracted (Camelot 0 cells) | RC-T **Layer-1** recovery (`table_areas`) | deferred — out-of-scope in RC-T spec |
+| DP-3 | 37122 figure-caption interleaved between stat & CI | RC-1 column interleave | deferred — banded flag exists, default-OFF |
+| DP-4 | cog_emo under-extraction (22 vs 47) | RC-T + RC-1 | partial (T6 prose-strip in `84a4d42`; T8 arm labels in v2.4.97); rest deferred |
+| DP-6 | 37122 results-summary table mashed into prose | RC-T Layer-1 / RC-1 | deferred |

{docpluck-2.4.96 → docpluck-2.4.97}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "docpluck"
-version = "2.4.96"
+version = "2.4.97"
 description = "PDF, DOCX, and HTML text extraction and normalization for academic papers"
 readme = "docs/README.md"
 requires-python = ">=3.10"

docpluck-2.4.97/tests/test_rc_t_layer2_raw_text_real_pdf.py ADDED Viewed

@@ -0,0 +1,163 @@
+"""RC-T Layer-2 — raw_text-fallback prose contamination (v2.4.98, 2026-06-22).
+When Camelot recovers no cells, ``_extract_table_body_text`` linearizes the
+text following a table caption as the ``unstructured-table`` fallback. Its
+per-line prose gate (``_line_is_body_prose``, len>=80) cannot see body prose
+that pdftotext WRAPPED into short (~48-char) lines, so the region overshoot
+swallowed Results/Discussion prose into the block:
+  * chan_feldman Table 1 — Discussion prose ("Our main focus was the
+    replication …") accumulated AFTER the table's ``Note:`` footnote.
+  * chan_feldman Table 9 — the block was ENTIRELY flowing prose ("than
+    empathy. We provided full analyses …") duplicating the real ``##
+    Discussion`` section verbatim.
+Two structural-signature fixes (rule 16), both FP-safe by construction:
+  1. Note-anchor: a table's ``Note:`` is its last element — trim everything
+     after the note paragraph (T1).
+  2. Degenerate-prose guard: suppress a block that STARTS mid-sentence with a
+     lowercase multi-letter word AND is majority prose; render then emits a
+     clean caption-only table (T9).
+Contract tests pin the FP-safe predicate deterministically; real-PDF tests
+(rule 0d) confirm on chan_feldman. PDFs are closed-access
+(``feedback_no_pdfs_in_repo``); real-PDF tests skip when the fixture is absent.
+"""
+from __future__ import annotations
+import os
+import re
+from pathlib import Path
+import pytest
+from docpluck.extract_structured import (
+    _join_wrapped_lines,
+    _raw_text_is_degenerate_prose,
+)
+from docpluck.render import render_pdf_to_markdown
+from .conftest import pdf_available, pdf_path, requires_pdftotext
+_skip_under_xdist = pytest.mark.skipif(
+    bool(os.environ.get("PYTEST_XDIST_WORKER")),
+    reason="real-PDF Camelot extraction is non-deterministic under parallel "
+    "xdist load; runs serially (isolation/serial run is the real gate)",
+)
+# ── contract tests: the FP-safe degenerate-prose predicate (deterministic) ────
+# All-prose block that STARTS mid-sentence (lowercase multi-letter word) — the
+# region-overshoot signature. Must be flagged degenerate.
+_DEGENERATE = (
+    "than empathy. We provided full analyses and results\n"
+    "for the comparisons in the supplementary materials\n"
+    "section of this paper across all of the conditions.\n"
+    "We replicated all of the supported findings of the\n"
+    "target article and summarised the results below here."
+)
+# Hypotheses table (legit, degraded): starts with a single-letter item marker.
+_HYPOTHESES = (
+    "a There is a positive association between a wronged\n"
+    "person's empathy for an offender and reported\n"
+    "forgiveness for the offender.\n"
+    "b Apology increases the likelihood of forgiving."
+)
+# Descriptive rows (legit): starts with a Capitalized label.
+_DESCRIPTIVE = "Median age (years)\n24.0\nAverage age\n28.8\n(years)\nStandard deviation"
+# Instrument-table fragment (legit): starts with a single-letter token "h".
+_INSTRUMENT = "h et al., 1997)\nPerceived apology\nEmpathy\nThe offender has apologised?"
+def test_degenerate_prose_flagged():
+    assert _raw_text_is_degenerate_prose(_DEGENERATE) is True
+def test_hypotheses_not_flagged():
+    """Single-letter item marker ('a ...') => not a mid-sentence continuation."""
+    assert _raw_text_is_degenerate_prose(_HYPOTHESES) is False
+def test_descriptive_rows_not_flagged():
+    assert _raw_text_is_degenerate_prose(_DESCRIPTIVE) is False
+def test_instrument_fragment_not_flagged():
+    assert _raw_text_is_degenerate_prose(_INSTRUMENT) is False
+def test_short_block_not_flagged():
+    assert _raw_text_is_degenerate_prose("than empathy.\nWe provided.") is False
+def test_join_wrapped_lines_merges_to_sentence():
+    assert _join_wrapped_lines(["a foo", "bar baz.", "next one."]) == [
+        "a foo bar baz.",
+        "next one.",
+    ]
+# ── real-PDF tests (chan_feldman) ─────────────────────────────────────────────
+def _unstructured_blocks(md: str) -> str:
+    """Whitespace-normalized concatenation of every ```unstructured-table``` block."""
+    blocks = re.findall(r"```unstructured-table\n(.*?)```", md, re.DOTALL)
+    return re.sub(r"\s+", " ", "\n".join(blocks))
+@pytest.fixture(scope="module")
+def chan_md() -> str:
+    key = "10.1080__02699931.2024.2434156"
+    if not pdf_available("articlerepo", f"{key}.pdf"):
+        pytest.skip(f"closed-access fixture missing: {key}.pdf")
+    return render_pdf_to_markdown(Path(pdf_path("articlerepo", f"{key}.pdf")).read_bytes())
+@requires_pdftotext
+@_skip_under_xdist
+def test_t1_note_anchor_trims_trailing_prose(chan_md: str):
+    """Table 1: body prose after the ``Note:`` footnote must be trimmed from the
+    fallback block (FAIL at HEAD — it was swallowed)."""
+    blocks = _unstructured_blocks(chan_md)
+    assert "Our main focus was the replication" not in blocks, (
+        "chan_feldman T1 still swallows post-Note Discussion prose — the "
+        "Note-anchor trim in _extract_table_body_text did not fire."
+    )
+@requires_pdftotext
+@_skip_under_xdist
+def test_t1_table_content_and_note_retained(chan_md: str):
+    """FP guard: the Note-anchor must KEEP the table content + the note itself
+    (hypotheses come before the note; trimming starts after it)."""
+    blocks = _unstructured_blocks(chan_md)
+    assert "There is a positive association" in blocks, "T1 hypothesis content lost (over-trim)"
+    assert "Hypothesis 3 is not included in the replication" in blocks, "T1 Note paragraph lost (over-trim)"
+@requires_pdftotext
+@_skip_under_xdist
+def test_t9_degenerate_block_suppressed_no_duplication(chan_md: str):
+    """Table 9: the all-prose fallback (a verbatim duplicate of ## Discussion)
+    must be suppressed — the Discussion opener appears exactly once, never inside
+    an unstructured-table block."""
+    opener = "We conducted a replication and extensions Registered Report"
+    assert opener not in _unstructured_blocks(chan_md), (
+        "chan_feldman T9 still dumps Discussion prose into an unstructured-table "
+        "block — the degenerate-prose guard did not fire."
+    )
+    assert "### Table 9" in chan_md, "T9 heading lost (table_parity broken)"
+    n = len(re.findall(re.escape(opener), chan_md))
+    assert n == 1, f"Discussion opener appears {n}x (expected 1 — T9 duplication not resolved)"
+@requires_pdftotext
+@_skip_under_xdist
+def test_t3_legit_fallback_table_survives(chan_md: str):
+    """FP guard: Table 3 (a real descriptive table starting with a Capitalized
+    label) must keep its fallback block + its Note — never suppressed/over-trimmed."""
+    blocks = _unstructured_blocks(chan_md)
+    assert "Median age" in blocks, "chan_feldman T3 descriptive fallback wrongly suppressed (FP)"
+    assert "Origin was not explicitly mentioned" in blocks, "T3 Note over-trimmed (FP)"

docpluck 2.4.96__tar.gz → 2.4.97__tar.gz

docpluck 2.4.96tar.gz → 2.4.97tar.gz