PyPI - extract-cli - Versions diffs - 0.1.10__tar.gz → 0.1.12__tar.gz - Mend

extract-cli 0.1.10tar.gz → 0.1.12tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (47) hide show

{extract_cli-0.1.10 → extract_cli-0.1.12}/ARCHITECTURE.md RENAMED Viewed

@@ -40,16 +40,23 @@ the "verify, not trust" contract downstream tools consume.
 ## The clause map
-`detect_clauses(text)` is a faithful port of template-vault-cli's three-tier
-cascade; the first tier that fires wins so fallbacks never shadow real
-structure:
+`detect_clauses(text)` extends template-vault-cli's clause cascade; the first
+tier that fires wins so fallbacks never shadow real structure:
-1. **`h2`** — `## Heading` (Markdown-native). Needs ≥ 1 match.
+1. **`h2`** — `## Heading` (Markdown-native; also what the DOCX reader emits for
+   Word heading styles / `w:numPr` paragraphs). Needs ≥ 1 match.
 2. **`bold-numbered`** — `**1. Purpose**`, `**Section 4. Term**` (typical of
    DOCX → text). Needs ≥ 2 matches.
-3. **`all-caps`** — blank-line-framed `CONFIDENTIALITY` lines (typical of legal
+3. **`numbered`** — plain `1. Term`, `Section 3. Payment`, and two-line
+   `ARTICLE N` + title (the dominant format in foreign paper), gated by a
+   title-case heuristic. Needs ≥ 2 matches.
+4. **`all-caps`** — blank-line-framed `CONFIDENTIALITY` lines (typical of legal
    PDFs), with the single-token-≥-4-letters rule. Needs ≥ 2 matches.
+(Plus an opt-in **`llm`** clause-map fallback under `--llm` when none of the
+above fire — see the LLM tier below.) After detection, running headers/footers
+and front/back-matter are filtered (`_is_noise_clause_title` + repeat dedup).
 `_strip_clause_number` removes leading numbering, including Roman numerals
 1–39 (`_ROMAN_RE` lists longer alternatives first so the engine doesn't
 short-circuit on a prefix — bare `V`/`X` match).

{extract_cli-0.1.10 → extract_cli-0.1.12}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,43 @@ to [Semantic Versioning](https://semver.org/). Per the suite convention
 (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
 the output schema require a major version bump**; new optional fields are minor.
+## [0.1.12] - 2026-05-22
+### Security
+- **Fixed an XML entity-expansion ("billion laughs") vulnerability in `.docx`
+  parsing.** The 0.1.9 resource bounds only checked *size*, but a tiny
+  `word/document.xml` declaring a DTD with nested entities passes the size
+  check and then expands exponentially in the XML parser (both ElementTree and
+  lxml/python-docx resolve internal entities). A new `_docx_xml_guard` runs
+  before either reader and refuses any `document.xml` that declares a
+  DTD/entities (a legitimate OOXML part never does) — degrading gracefully to
+  empty text with a warning. Verified on both the stdlib and `[docx]` paths.
+## [0.1.11] - 2026-05-22
+Polish pass.
+### Fixed
+- **Signature blocks no longer capture the next column's label.** A two-column
+  unsigned block (`By:        By:` / `Name:      Name:`) used to yield garbage
+  signatories like `{"name": "By:", "title": "Title:"}`; such captures (and
+  blank fill lines) are now rejected, so an unsigned template correctly returns
+  no signatories.
+### Changed
+- **`extract fields` and `--format table` now surface `jurisdiction`,
+  `amounts`, and `signatories`** — they were extracted and in the JSON but not
+  discoverable via the catalog or the human table view. A drift-guard test now
+  asserts `extract fields` can't diverge from the output schema.
+- **Confidence values centralized into a documented scale** (named `CONF_*`
+  constants with a single descending ladder, replacing scattered magic numbers)
+  so downstream "verify, not trust" thresholds are principled. The only value
+  change: an affirmative auto-renewal is now `0.70` (was `0.65`), matching the
+  other best-effort term fields.
+- Docs sweep: refreshed the clause-cascade description (h2 → bold-numbered →
+  numbered → all-caps, + the `--llm` fallback) across README/ARCHITECTURE and
+  the output-shape example. Line coverage held at 100% (CI-gated).
 ## [0.1.10] - 2026-05-22
 ### Fixed
@@ -296,6 +333,8 @@ Initial release — the open-loop front door of the contract-ops CLI suite.
   intentionally *not* governed by the output schema (the schema describes the
   full default output).
+[0.1.12]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.12
+[0.1.11]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.11
 [0.1.10]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.10
 [0.1.9]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.9
 [0.1.8]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.8

{extract_cli-0.1.10 → extract_cli-0.1.12}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: extract-cli
-Version: 0.1.10
+Version: 0.1.12
 Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.
 Project-URL: Homepage, https://cli.drbaher.com/
 Project-URL: Repository, https://github.com/DrBaher/extract-cli
@@ -180,16 +180,17 @@ Streams follow the suite convention: **stdout** is the machine payload (JSON),
   "value":      { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
   "amounts":    [ { "value": "$50,000", "confidence": 0.6, "source": "deterministic" } ],
   "signatories": [ { "name": "Jane Doe", "title": "CEO", "confidence": 0.55, "source": "deterministic" } ],
-  "_meta":      { "extractor_version": "0.1.9", "tiers_used": ["deterministic"], "llm_used": false }
+  "_meta":      { "extractor_version": "0.1.11", "tiers_used": ["deterministic"], "llm_used": false }
 }
 ```
 ## The clause map (the differentiator)
 A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's
-"## Confidentiality" are the same clause. `extract-cli` reuses
-template-vault-cli's **clause-detection cascade** (Tier 1 `## H2` headings →
-Tier 2 bold-numbered `**1. …**` → Tier 3 ALL-CAPS lines) and a built-in
+"## Confidentiality" are the same clause. `extract-cli` extends
+template-vault-cli's **clause-detection cascade** — `## H2` headings →
+bold-numbered `**1. …**` → plain numbered (`1. Term`, `Section 3. …`, two-line
+`ARTICLE N`) → ALL-CAPS lines (and an opt-in `--llm` fallback) — plus a built-in
 **canonical alias vocabulary** to normalize foreign clause titles onto the
 names the rest of the suite already speaks. Clauses it can't map are kept with
 `mapped: false` (and a `*` in the table view) so nothing is silently dropped.

{extract_cli-0.1.10 → extract_cli-0.1.12}/README.md RENAMED Viewed

@@ -142,16 +142,17 @@ Streams follow the suite convention: **stdout** is the machine payload (JSON),
   "value":      { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
   "amounts":    [ { "value": "$50,000", "confidence": 0.6, "source": "deterministic" } ],
   "signatories": [ { "name": "Jane Doe", "title": "CEO", "confidence": 0.55, "source": "deterministic" } ],
-  "_meta":      { "extractor_version": "0.1.9", "tiers_used": ["deterministic"], "llm_used": false }
+  "_meta":      { "extractor_version": "0.1.11", "tiers_used": ["deterministic"], "llm_used": false }
 }
 ```
 ## The clause map (the differentiator)
 A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's
-"## Confidentiality" are the same clause. `extract-cli` reuses
-template-vault-cli's **clause-detection cascade** (Tier 1 `## H2` headings →
-Tier 2 bold-numbered `**1. …**` → Tier 3 ALL-CAPS lines) and a built-in
+"## Confidentiality" are the same clause. `extract-cli` extends
+template-vault-cli's **clause-detection cascade** — `## H2` headings →
+bold-numbered `**1. …**` → plain numbered (`1. Term`, `Section 3. …`, two-line
+`ARTICLE N`) → ALL-CAPS lines (and an opt-in `--llm` fallback) — plus a built-in
 **canonical alias vocabulary** to normalize foreign clause titles onto the
 names the rest of the suite already speaks. Clauses it can't map are kept with
 `mapped: false` (and a `*` in the table view) so nothing is silently dropped.

{extract_cli-0.1.10 → extract_cli-0.1.12}/extract_cli.py RENAMED Viewed

@@ -43,11 +43,11 @@ import urllib.request
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
-__version__ = "0.1.10"
+__version__ = "0.1.12"
 # Bumped independently of the package version when the *extraction logic*
 # changes in a way downstream consumers should notice. Embedded in `_meta`.
-EXTRACTOR_VERSION = "0.1.10"
+EXTRACTOR_VERSION = "0.1.12"
 # JSON Schema version of the output contract (docs/spec/extract-output.schema.json).
 SCHEMA_VERSION = 1
@@ -503,6 +503,42 @@ def _none_field() -> JSON:
     return {"value": None, "confidence": 0.0, "source": "none"}
+# --- Confidence scale -------------------------------------------------------
+# These confidences are "verify, not trust" hints in [0, 1] -- a ranking of
+# *structural certainty*, not calibrated probabilities. Higher means the
+# extraction rests on more unambiguous structure; lower means a looser heuristic
+# or an LLM guess. Downstream tools threshold on them, so they are centralized
+# here and ordered into a single descending ladder rather than scattered as
+# magic numbers:
+#
+#   .95  explicit Markdown H2 heading
+#   .90  strong unambiguous pattern (parties "between X and Y"; labeled date)
+#   .85  clear keyword/structure (governing law; ISO date; bold-numbered heading)
+#   .80  keyworded but looser (plain numbered/ARTICLE heading; jurisdiction code)
+#   .75  structural-only heading (ALL-CAPS)
+#   .70  best-effort regex on common phrasing (term length, notice, auto-renew)
+#   .60  weak heuristic / LLM-enriched scalar (value, amounts, defined terms)
+#   .55  loose match (signature block, LLM obligations, non-ISO raw date)
+#   .50  fuzzy (LLM clause-map fallback)
+CONF_H2 = 0.95
+CONF_PARTIES = 0.90
+CONF_DATE_LABELED = 0.90
+CONF_DATE_ISO = 0.85
+CONF_GOVERNING_LAW = 0.85
+CONF_BOLD_HEADING = 0.85
+CONF_NUMBERED_HEADING = 0.80
+CONF_JURISDICTION = 0.80
+CONF_ALLCAPS_HEADING = 0.75
+CONF_TERM = 0.70
+CONF_WEAK = 0.60
+CONF_LLM = 0.60
+CONF_DATE_RAW = 0.55
+CONF_LLM_LIST = 0.55
+CONF_SIGNATORY = 0.55
+CONF_LLM_CLAUSE = 0.50
+CONF_UNMAPPED_FACTOR = 0.75  # multiplier applied to a clause that doesn't map to the vocabulary
 def _titlecase(s: str) -> str:
     s = s.strip()
     if not s:
@@ -675,7 +711,7 @@ def _date_field_from_str(raw: str, base_conf: float) -> JSON:
 def _date_field(match: Optional["re.Match[str]"]) -> JSON:
     if match is None:
         return _none_field()
-    return _date_field_from_str(match.group(1), 0.85)
+    return _date_field_from_str(match.group(1), CONF_DATE_ISO)
 # Trailing descriptors that follow a party's actual name and should be dropped
@@ -739,7 +775,7 @@ def extract_parties(text: str) -> List[JSON]:
         name, role = _split_name_role(raw)
         if not name or len(name) < 2 or len(name) > 120:
             continue
-        entry: JSON = {"name": name, "confidence": 0.9, "source": "deterministic"}
+        entry: JSON = {"name": name, "confidence": CONF_PARTIES, "source": "deterministic"}
         entry["role"] = role
         out.append(entry)
     return out
@@ -748,7 +784,7 @@ def extract_parties(text: str) -> List[JSON]:
 def extract_dates(text: str) -> JSON:
     label = _EFFDATE_LABEL_RE.search(text)
     if label is not None:
-        effective = _date_field_from_str(label.group(1), 0.9)
+        effective = _date_field_from_str(label.group(1), CONF_DATE_LABELED)
     else:
         effective = _date_field(_EFFECTIVE_RE.search(text))
     return {"effective": effective, "expiration": _date_field(_EXPIRE_RE.search(text))}
@@ -761,7 +797,7 @@ def extract_governing_law(text: str) -> JSON:
     juris = re.sub(r"\s+", " ", m.group(1).strip().rstrip(".,")).strip()
     if not juris:  # pragma: no cover - the capture group requires a leading letter
         return _none_field()
-    return _field(juris, 0.85)
+    return _field(juris, CONF_GOVERNING_LAW)
 def extract_term(text: str) -> JSON:
@@ -773,20 +809,20 @@ def extract_term(text: str) -> JSON:
         # Only emit when the captured token is a real number; otherwise the
         # match was a coincidence ("...consecutive days") -> leave as not-found.
         if num is not None:
-            length = _field(f"{num} {unit}{'s' if num != 1 else ''}", 0.7)
+            length = _field(f"{num} {unit}{'s' if num != 1 else ''}", CONF_TERM)
     notice = _none_field()
     nm = _NOTICE_RE.search(text)
     if nm:
         days = _word_to_int(nm.group(1))
         if days is not None:
-            notice = _field(days, 0.7)
+            notice = _field(days, CONF_TERM)
     auto = _none_field()
     if _AUTORENEW_NEG_RE.search(text):
-        auto = _field(False, 0.7)
+        auto = _field(False, CONF_TERM)
     elif _AUTORENEW_POS_RE.search(text):
-        auto = _field(True, 0.65)
+        auto = _field(True, CONF_TERM)
     return {"length": length, "auto_renew": auto, "notice_period_days": notice}
@@ -795,7 +831,7 @@ def extract_value(text: str) -> JSON:
     m = _MONEY_RE.search(text)
     if not m:
         return _none_field()
-    return _field(re.sub(r"\s+", " ", m.group(0).strip()), 0.6)
+    return _field(re.sub(r"\s+", " ", m.group(0).strip()), CONF_WEAK)
 def extract_amounts(text: str) -> List[JSON]:
@@ -807,7 +843,7 @@ def extract_amounts(text: str) -> List[JSON]:
         seen.setdefault(amt, None)
         if len(seen) >= 30:
             break
-    return [{"value": a, "confidence": 0.6, "source": "deterministic"} for a in seen]
+    return [{"value": a, "confidence": CONF_WEAK, "source": "deterministic"} for a in seen]
 # Signature blocks: "By: <name>", "Name: <name>", "Printed Name: <name>".
@@ -820,20 +856,32 @@ _SIG_TITLE_RE = re.compile(
     r"(?:^|\n)[ \t]*(?:Title|Its)[ \t]*:[ \t]*([^\n_{}\[\]]{2,60})",
     re.IGNORECASE,
 )
+# A captured value is rejected when it's really the next column's label (common
+# in two-column signature blocks: "By:    By:") or a blank fill line.
+_SIG_LABEL_RE = re.compile(r"(?:by|name|title|signature|its|date|signed|print)\b", re.IGNORECASE)
+def _clean_sig_value(raw: str) -> Optional[str]:
+    v = re.sub(r"\s+", " ", raw).strip(" .,:")
+    if (len(v) < 2 or v.lower() == "the"
+            or not any(c.isalpha() for c in v)
+            or _SIG_LABEL_RE.match(v)):
+        return None
+    return v
 def extract_signatories(text: str) -> List[JSON]:
     """Best-effort signature-block names (and titles, when adjacent). Skips
     unfilled placeholders. Blank on a template; populated on executed paper."""
-    titles = [re.sub(r"\s+", " ", m.group(1)).strip(" .,") for m in _SIG_TITLE_RE.finditer(text)]
+    titles = [_clean_sig_value(m.group(1)) for m in _SIG_TITLE_RE.finditer(text)]
     out: List[JSON] = []
     seen: Dict[str, None] = {}
     for i, m in enumerate(_SIGNATORY_RE.finditer(text)):
-        name = re.sub(r"\s+", " ", m.group(1)).strip(" .,")
-        if len(name) < 2 or name.lower() in ("the", "name", "title") or name in seen:
+        name = _clean_sig_value(m.group(1))
+        if name is None or name in seen:
             continue
         seen[name] = None
-        entry: JSON = {"name": name, "confidence": 0.55, "source": "deterministic"}
+        entry: JSON = {"name": name, "confidence": CONF_SIGNATORY, "source": "deterministic"}
         entry["title"] = titles[i] if i < len(titles) else None
         out.append(entry)
         if len(out) >= 12:
@@ -869,7 +917,7 @@ def extract_jurisdiction(governing_law: JSON) -> JSON:
             if len(name) >= 5 and name in key:
                 code = c
                 break
-    return _field(code, 0.8, "deterministic") if code else _none_field()
+    return _field(code, CONF_JURISDICTION, "deterministic") if code else _none_field()
 def extract_defined_terms(text: str) -> List[JSON]:
@@ -885,7 +933,7 @@ def extract_defined_terms(text: str) -> List[JSON]:
             seen.setdefault(term, None)
             if len(seen) >= 50:
                 break
-    return [{"term": t, "confidence": 0.6, "source": "deterministic"} for t in seen]
+    return [{"term": t, "confidence": CONF_WEAK, "source": "deterministic"} for t in seen]
 # Detected-heading titles that are almost never real clauses: front/back-matter,
@@ -936,9 +984,9 @@ def extract_clauses(text: str) -> List[JSON]:
             continue
         canonical, mapped = _canonicalize_clause(c["title"])
         tier = c["tier"]
-        base = {"h2": 0.95, "bold-numbered": 0.85, "numbered": 0.8,
-                "all-caps": 0.75, "explicit": 0.95}.get(tier, 0.7)
-        conf = round(base * (1.0 if mapped else 0.75), 2)
+        base = {"h2": CONF_H2, "bold-numbered": CONF_BOLD_HEADING, "numbered": CONF_NUMBERED_HEADING,
+                "all-caps": CONF_ALLCAPS_HEADING, "explicit": CONF_H2}.get(tier, CONF_TERM)
+        conf = round(base * (1.0 if mapped else CONF_UNMAPPED_FACTOR), 2)
         out.append({
             "canonical_title": canonical,
             "detected_title": c["detected"],
@@ -1062,6 +1110,32 @@ def _read_html(raw_text: str) -> str:
     return parser.get_text()
+def _docx_xml_guard(raw: bytes) -> Optional[str]:
+    """Run before EITHER docx reader on untrusted input. Returns a reason string
+    if word/document.xml is unsafe to parse, else None:
+      * decompresses past MAX_DECOMPRESSED_BYTES (zip bomb), or
+      * declares a DTD/entities -- a tiny 'billion laughs' part that passes the
+        size check but expands exponentially in the XML parser (ElementTree
+        *and* lxml/python-docx resolve internal entities). A legitimate OOXML
+        document.xml never declares one, so refusing is safe.
+    """
+    import io
+    import zipfile
+    try:
+        with zipfile.ZipFile(io.BytesIO(raw)) as z:
+            info = z.getinfo("word/document.xml")
+            if info.file_size > MAX_DECOMPRESSED_BYTES:
+                return (f"word/document.xml decompresses to {info.file_size} bytes "
+                        f"(> {MAX_DECOMPRESSED_BYTES} cap)")
+            with z.open("word/document.xml") as f:
+                head = f.read(65536)
+    except Exception:
+        return None  # not a valid zip / no document.xml -> let the readers report it
+    if re.search(rb"<!DOCTYPE|<!ENTITY", head, re.IGNORECASE):
+        return "document.xml declares a DTD/entities (XML-bomb guard)"
+    return None
 def _read_docx(path: Path, raw: bytes, prefer_optional: bool = True) -> Tuple[str, List[str]]:
     """Extract text from a .docx. Uses python-docx for higher fidelity when the
     optional [docx] extra is installed; otherwise a stdlib zipfile/XML reader
@@ -1070,6 +1144,10 @@ def _read_docx(path: Path, raw: bytes, prefer_optional: bool = True) -> Tuple[st
     `prefer_optional=False` forces the stdlib reader regardless of what's
     installed -- used to pin reproducible golden fixtures."""
     warnings: List[str] = []
+    unsafe = _docx_xml_guard(raw)
+    if unsafe is not None:
+        warnings.append(f"could not parse .docx ({unsafe}); treating as empty")
+        return "", warnings
     if prefer_optional and importlib.util.find_spec("docx") is not None:
         try:
             mod = importlib.import_module("docx")
@@ -1168,14 +1246,7 @@ def _read_docx_stdlib(raw: bytes) -> str:
     w = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
     with zipfile.ZipFile(io.BytesIO(raw)) as z:
-        # Zip-bomb guard: the uncompressed size is in the header, so check it
-        # before reading (don't decompress GBs into memory).
-        info = z.getinfo("word/document.xml")
-        if info.file_size > MAX_DECOMPRESSED_BYTES:
-            raise ValueError(
-                f"word/document.xml decompresses to {info.file_size} bytes "
-                f"(> {MAX_DECOMPRESSED_BYTES} cap)")
-        xml = z.read("word/document.xml")
+        xml = z.read("word/document.xml")  # size/XML-bomb already vetted by _docx_xml_guard
     root = ET.fromstring(xml)
     paras: List[str] = []
     # iter over w:p in document order (includes paragraphs inside table cells).
@@ -1572,7 +1643,7 @@ def _llm_clause_map(raw: Any, text: str) -> List[JSON]:
             "detected_title": title,
             "tier": "llm",
             "span": span,
-            "confidence": 0.5,
+            "confidence": CONF_LLM_CLAUSE,
             "source": "llm",
             "mapped": mapped,
         })
@@ -1607,18 +1678,18 @@ def llm_enrich(result: JSON, text: str, args_ns: argparse.Namespace) -> None:
     enriched = False
     rm = obj.get("renewal_mechanics")
     if isinstance(rm, str) and rm.strip():
-        result["term"]["renewal_mechanics"] = _field(rm.strip(), 0.6, "llm")
+        result["term"]["renewal_mechanics"] = _field(rm.strip(), CONF_LLM, "llm")
         enriched = True
     obligations = obj.get("obligations")
     if isinstance(obligations, list) and obligations:
         result["obligations"] = [
-            {"text": str(o).strip(), "confidence": 0.55, "source": "llm"}
+            {"text": str(o).strip(), "confidence": CONF_LLM_LIST, "source": "llm"}
             for o in obligations[:5] if str(o).strip()
         ]
         enriched = True
     gl = obj.get("governing_law")
     if isinstance(gl, str) and gl.strip() and result["governing_law"]["source"] == "none":
-        result["governing_law"] = _field(gl.strip(), 0.6, "llm")
+        result["governing_law"] = _field(gl.strip(), CONF_LLM, "llm")
         enriched = True
     if want_clauses:
         cmap = _llm_clause_map(obj.get("clauses"), text)
@@ -1707,10 +1778,20 @@ def render_table(result: JSON, no_confidence: bool) -> str:
             lines.append(f"  renewal     : {_fv(term['renewal_mechanics'])} {_dim('[llm]')}")
     if "governing_law" in result:
         lines.append(_bold("Governing law"))
-        lines.append(f"  {_fv(result['governing_law'])}")
+        juris = result.get("jurisdiction", {}).get("value")
+        suffix = _dim(f"  [{juris}]") if juris else ""
+        lines.append(f"  {_fv(result['governing_law'])}{suffix}")
     if "value" in result:
+        amts = result.get("amounts") or []
+        extra = _dim(f"  (+{len(amts) - 1} more)") if len(amts) > 1 else ""
         lines.append(_bold("Value"))
-        lines.append(f"  {_fv(result['value'])}")
+        lines.append(f"  {_fv(result['value'])}{extra}")
+    signatories = result.get("signatories")
+    if signatories:
+        lines.append(_bold(f"Signatories ({len(signatories)})"))
+        for s in signatories[:6]:
+            title = f" - {s['title']}" if s.get("title") else ""
+            lines.append(f"  {s['name']}{title}")
     clauses = result.get("clauses")
     if clauses is not None:
         lines.append(_bold(f"Clause map ({len(clauses)})"))
@@ -1935,11 +2016,14 @@ FIELD_CATALOG: Tuple[Tuple[str, str, str], ...] = (
     ("term.length", "deterministic", "Term length, best-effort"),
     ("term.notice_period_days", "deterministic", "Notice period in days, best-effort"),
     ("term.auto_renew", "deterministic", "Auto-renewal flag, best-effort"),
-    ("governing_law", "deterministic", "Governing law / jurisdiction"),
+    ("governing_law", "deterministic", "Governing law text ('governed by the laws of ...')"),
+    ("jurisdiction", "deterministic", "Governing law normalized to a code (e.g. US-DE)"),
     ("clauses", "deterministic", "Clause map normalized to the suite's canonical vocabulary "
                                  "(LLM fallback under --llm when no headings are detected)"),
     ("defined_terms", "deterministic", "Defined-term inventory (quoted / parenthetical)"),
     ("value", "deterministic", "Headline monetary value"),
+    ("amounts", "deterministic", "All distinct monetary amounts"),
+    ("signatories", "deterministic", "Signature-block names/titles (By:/Name:/Title:)"),
     ("term.renewal_mechanics", "llm", "Renewal mechanics (fuzzy; --llm only)"),
     ("obligations", "llm", "Key obligation phrasing (fuzzy; --llm only)"),
 )

{extract_cli-0.1.10 → extract_cli-0.1.12}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "extract-cli"
-version = "0.1.10"
+version = "0.1.12"
 description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON."
 readme = "README.md"
 requires-python = ">=3.9"

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/employment_docx.docx.expected.json RENAMED Viewed

@@ -39,7 +39,7 @@
     },
     "auto_renew": {
       "value": true,
-      "confidence": 0.65,
+      "confidence": 0.7,
       "source": "deterministic"
     },
     "notice_period_days": {
@@ -151,7 +151,7 @@
   ],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/heading_docx.docx.expected.json RENAMED Viewed

@@ -39,7 +39,7 @@
     },
     "auto_renew": {
       "value": true,
-      "confidence": 0.65,
+      "confidence": 0.7,
       "source": "deterministic"
     },
     "notice_period_days": {
@@ -140,7 +140,7 @@
   "amounts": [],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/lease_allcaps.txt.expected.json RENAMED Viewed

@@ -146,7 +146,7 @@
   ],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/license_pdf.pdf.expected.json RENAMED Viewed

@@ -146,7 +146,7 @@
   ],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/nda_h2.md.expected.json RENAMED Viewed

@@ -39,7 +39,7 @@
     },
     "auto_renew": {
       "value": true,
-      "confidence": 0.65,
+      "confidence": 0.7,
       "source": "deterministic"
     },
     "notice_period_days": {
@@ -150,7 +150,7 @@
   "amounts": [],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/numbered_docx.docx.expected.json RENAMED Viewed

@@ -140,7 +140,7 @@
   "amounts": [],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/scanned.pdf.expected.json RENAMED Viewed

@@ -55,7 +55,7 @@
   "amounts": [],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/services_bold.txt.expected.json RENAMED Viewed

@@ -146,7 +146,7 @@
   ],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/fixtures/services_html.html.expected.json RENAMED Viewed

@@ -39,7 +39,7 @@
     },
     "auto_renew": {
       "value": true,
-      "confidence": 0.65,
+      "confidence": 0.7,
       "source": "deterministic"
     },
     "notice_period_days": {
@@ -161,7 +161,7 @@
   ],
   "signatories": [],
   "_meta": {
-    "extractor_version": "0.1.10",
+    "extractor_version": "0.1.12",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/test_coverage.py RENAMED Viewed

@@ -211,6 +211,19 @@ def test_render_table_unmapped_legend() -> None:
     assert "* = not mapped" in ex.render_table(r, no_confidence=False)
+def test_render_table_jurisdiction_amounts_signatories() -> None:
+    r = ex.build_extraction("body", b"x", "markdown", "x.md")
+    r["jurisdiction"] = ex._field("US-DE", ex.CONF_JURISDICTION)
+    r["amounts"] = [{"value": "$1", "confidence": 0.6, "source": "deterministic"},
+                    {"value": "$2", "confidence": 0.6, "source": "deterministic"}]
+    r["signatories"] = [{"name": "Jane Doe", "title": "CEO",
+                         "confidence": ex.CONF_SIGNATORY, "source": "deterministic"}]
+    table = ex.render_table(r, no_confidence=False)
+    assert "US-DE" in table
+    assert "+1 more" in table
+    assert "Signatories (1)" in table and "Jane Doe - CEO" in table
 def test_cli_silent_table_suppresses_human_view(capsys: pytest.CaptureFixture[str]) -> None:
     assert ex.main([str(FIXTURES / "nda_h2.md"), "--silent", "--format", "table"]) == 0
     assert "Clause map" not in capsys.readouterr().out

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/test_deterministic.py RENAMED Viewed

@@ -165,6 +165,13 @@ def test_signatories() -> None:
     assert ex.extract_signatories("Name: {party_1_signatory}\nBy: _____________") == []
+def test_signatories_two_column_blank_block() -> None:
+    # An unsigned two-column block ("By:   By:") must NOT capture the next
+    # column's label as a name.
+    text = "By:        By:\nName:      Name:\nTitle:     Title:\n"
+    assert ex.extract_signatories(text) == []
 def test_value_money() -> None:
     assert ex.extract_value("a fee of $250,000 is due")["value"] == "$250,000"
     assert ex.extract_value("budget is USD 1.5 million")["value"].startswith("USD")

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/test_misc.py RENAMED Viewed

@@ -277,6 +277,30 @@ def test_docx_zip_bomb_guard(tmp_path: Any) -> None:
     assert any("decompress" in w for w in warnings)
+def test_docx_xml_entity_bomb_refused(tmp_path: Any) -> None:
+    # A tiny 'billion laughs' document.xml passes the size check but would expand
+    # exponentially in the XML parser; the DTD/entity guard refuses it.
+    import io
+    import zipfile
+    w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
+    bomb = (
+        '<?xml version="1.0"?>\n'
+        '<!DOCTYPE r [<!ENTITY a "AAAA"><!ENTITY b "&a;&a;&a;&a;">]>\n'
+        f'<w:document xmlns:w="{w}"><w:body><w:p><w:r><w:t>&b;</w:t></w:r>'
+        '</w:p></w:body></w:document>'
+    ).encode()
+    buf = io.BytesIO()
+    with zipfile.ZipFile(buf, "w") as z:
+        z.writestr("[Content_Types].xml", "<Types/>")
+        z.writestr("word/document.xml", bomb)
+    p = tmp_path / "xmlbomb.docx"
+    p.write_bytes(buf.getvalue())
+    assert p.stat().st_size < 100_000  # tiny on disk
+    raw, text, fmt, warnings = ex.load_source(p)  # default reader path
+    assert fmt == "docx" and text == ""
+    assert any("DTD/entities" in w for w in warnings)
 def test_numbered_docx_clauses() -> None:
     """A DOCX whose clauses are w:numPr list paragraphs (no heading style, no
     visible number) still yields a clause map; a deep numbered body sentence is

{extract_cli-0.1.10 → extract_cli-0.1.12}/tests/test_schema_conformance.py RENAMED Viewed

@@ -56,6 +56,20 @@ def test_schema_command_emits_committed_spec() -> None:
     assert json.loads(proc.stdout) == json.loads(SPEC_FILE.read_text(encoding="utf-8"))
+def test_fields_catalog_covers_schema() -> None:
+    """`extract fields` (FIELD_CATALOG) must not silently drift from the output
+    schema -- every top-level output field appears in the catalog."""
+    schema_top = set(SCHEMA["properties"]) - {"_meta"}
+    catalog_prefixes = {f.split(".")[0] for f, _tier, _desc in ex.FIELD_CATALOG}
+    assert schema_top - catalog_prefixes == set()
+def test_confidence_scale_is_a_descending_ladder() -> None:
+    assert ex.CONF_H2 >= ex.CONF_PARTIES >= ex.CONF_GOVERNING_LAW >= ex.CONF_NUMBERED_HEADING
+    assert ex.CONF_ALLCAPS_HEADING >= ex.CONF_TERM >= ex.CONF_WEAK >= ex.CONF_LLM_CLAUSE
+    assert 0.0 < ex.CONF_LLM_CLAUSE and ex.CONF_H2 <= 1.0
 def test_schema_is_self_describing() -> None:
     assert SCHEMA["$schema"] == "https://json-schema.org/draft/2020-12/schema"
     assert "extract-cli" in SCHEMA["title"]