PyPI - extract-cli - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

extract-cli 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

{extract_cli-0.1.1 → extract_cli-0.1.3}/ARCHITECTURE.md RENAMED Viewed

@@ -8,11 +8,14 @@ map.
 ```
 load_source(path)                       extension/content sniff → reader
   ├─ .md/.txt  → utf-8 decode
+  ├─ .html     → stdlib html.parser reader (also auto-detected inside .txt)
   ├─ .docx     → python-docx (if [docx]) else stdlib zipfile/XML reader
   └─ .pdf      → pypdf (if [pdf]) else stdlib zlib + text-operator reader
         │
         ▼  (raw_bytes, text, format, warnings)
 build_extraction(text, raw, fmt, src)   the DETERMINISTIC tier (always on)
+  │  field extractors run on a whitespace-FLATTENED copy (so values that wrap
+  │  across a line are matched whole); clause detection keeps the original text
   ├─ extract_parties          "between X and Y", with role parentheticals
   ├─ extract_dates            effective / expiration, ISO-normalized
   ├─ extract_term             length / auto_renew / notice_period_days

{extract_cli-0.1.1 → extract_cli-0.1.3}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,64 @@ to [Semantic Versioning](https://semver.org/). Per the suite convention
 (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
 the output schema require a major version bump**; new optional fields are minor.
+## [0.1.3] - 2026-05-21
+Clause-map de-noising and party cleanup, driven by testing against 10 more
+contracts (SEC EDGAR credit, loan, employment, lease, asset-purchase, and
+consulting HTML exhibits; Apache PDFs).
+### Fixed
+- **Clause map drops structural noise** common in dense real documents:
+  a heading whose title repeats 3+ times is treated as a running header/footer
+  (one lease's `Ks 112708-2` page code went from 44 "clauses" to 0), and
+  front/back-matter (`Table of Contents`, `Exhibit B`, `Schedule 2.1`) and
+  document codes/page numbers (4+ consecutive digits) are filtered out.
+- **Party-name cleanup** extended: trailing `together with …`, `, as
+  administrative agent`, and a dangling unclosed parenthetical
+  (`(each of them being`) are trimmed.
+### Notes
+- On dense documents the deterministic clause map can still surface a few
+  non-clause headings (e.g. address lines in a notices block); consumers
+  wanting only suite-vocabulary clauses should filter on `mapped == true`,
+  which isolates the real clauses (the noise is always `mapped == false`).
+- Known best-effort edge cases on varied real paper: a bare role word as a
+  party name ("Landlord"), and a middle-initial period truncating a personal
+  name ("John C." → "John C"). Best-effort fields carry confidence/source.
+## [0.1.2] - 2026-05-21
+More real-world hardening, driven by testing against five additional contracts
+(SEC EDGAR consulting/MSA, lease, and Visteon services agreements; Common Paper
+and Perigon Cloud Service Agreements).
+### Added
+- **HTML input** (`.html`/`.htm`, and HTML auto-detected inside `.txt` such as
+  SEC EDGAR full submissions). Stdlib `html.parser`-based reader strips
+  script/style, frames block elements so heading detection still works, and
+  unescapes entities. `document.format` enum gains `html` (backward-compatible
+  widening). This turns the large class of HTML contracts (SEC exhibits, web
+  ToS) from garbage into structured output.
+### Fixed
+- **Field extraction now runs on whitespace-flattened text**, so values that
+  wrap across a line break are matched whole — e.g. governing law
+  `the laws of the Province\nof Ontario` now yields `Province of Ontario`, and
+  line-wrapped party names/defined terms are captured.
+- **Party extraction** (continues issue #2): names are trimmed of trailing
+  descriptors (`, a Delaware corporation`, `doing business as …`,
+  `having its offices at …`, `as of …`), and each party must begin with a
+  capital so an `and` *inside* a party's own description no longer splits the
+  parties (`…V6E 3S7 and doing business as …` → real parties recovered).
+### Known limitations (documented, not bugs)
+- The stdlib PDF reader cannot decode PDFs that use embedded subset fonts with
+  hex-encoded glyph strings (common in professionally-typeset PDFs); these
+  degrade gracefully to a low-signal warning. Install the `[pdf]` extra (pypdf)
+  for them — verified to recover full text and clause structure.
+- Two-line `ARTICLE N` / title headings (number on one line, title on the next)
+  are not yet detected.
 ## [0.1.1] - 2026-05-21
 Real-world hardening, driven by testing against a SEC EDGAR employment
@@ -82,5 +140,7 @@ Initial release — the open-loop front door of the contract-ops CLI suite.
   intentionally *not* governed by the output schema (the schema describes the
   full default output).
+[0.1.3]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.3
+[0.1.2]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.2
 [0.1.1]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.1
 [0.1.0]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.0

{extract_cli-0.1.1 → extract_cli-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: extract-cli
-Version: 0.1.1
-Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON.
+Version: 0.1.3
+Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.
 Project-URL: Homepage, https://cli.drbaher.com/
 Project-URL: Repository, https://github.com/DrBaher/extract-cli
 Project-URL: Suite interop, https://github.com/DrBaher/extract-cli/blob/main/docs/INTEROP.md
@@ -63,8 +63,8 @@ ingest (extract) → review → diff → convert → sign
 ## What it does
-Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
-and it returns structured JSON: the parties, dates, term, governing law, a
+Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
+**`.pdf`**, and it returns structured JSON: the parties, dates, term, governing law, a
 **clause map** normalized onto the suite's canonical clause vocabulary, a
 defined-term inventory, and a headline value. Every field carries a
 `confidence` and a `source` so downstream tools **verify, don't trust**.
@@ -75,14 +75,15 @@ daemon, no network in the default path.
 ## Install
 ```bash
-pip install extract-cli                 # core: .md/.txt + best-effort .docx/.pdf
+pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
 pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
 pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
 pip install "extract-cli[docx,pdf]"     # both
 ```
-The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
-with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
+The core has **zero runtime dependencies** and is fully functional on
+`.md`/`.txt`/`.html` with no extras (HTML is also auto-detected when it hides
+inside a `.txt`, e.g. SEC EDGAR filings). `.docx` and `.pdf` work out of the box via stdlib readers; the
 `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
 [ARCHITECTURE.md](ARCHITECTURE.md)).

{extract_cli-0.1.1 → extract_cli-0.1.3}/README.md RENAMED Viewed

@@ -25,8 +25,8 @@ ingest (extract) → review → diff → convert → sign
 ## What it does
-Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
-and it returns structured JSON: the parties, dates, term, governing law, a
+Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
+**`.pdf`**, and it returns structured JSON: the parties, dates, term, governing law, a
 **clause map** normalized onto the suite's canonical clause vocabulary, a
 defined-term inventory, and a headline value. Every field carries a
 `confidence` and a `source` so downstream tools **verify, don't trust**.
@@ -37,14 +37,15 @@ daemon, no network in the default path.
 ## Install
 ```bash
-pip install extract-cli                 # core: .md/.txt + best-effort .docx/.pdf
+pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
 pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
 pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
 pip install "extract-cli[docx,pdf]"     # both
 ```
-The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
-with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
+The core has **zero runtime dependencies** and is fully functional on
+`.md`/`.txt`/`.html` with no extras (HTML is also auto-detected when it hides
+inside a `.txt`, e.g. SEC EDGAR filings). `.docx` and `.pdf` work out of the box via stdlib readers; the
 `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
 [ARCHITECTURE.md](ARCHITECTURE.md)).

{extract_cli-0.1.1 → extract_cli-0.1.3}/docs/spec/extract-output.schema.json RENAMED Viewed

@@ -69,7 +69,8 @@
             "markdown",
             "text",
             "docx",
-            "pdf"
+            "pdf",
+            "html"
           ]
         },
         "sha256": {

{extract_cli-0.1.1 → extract_cli-0.1.3}/extract_cli.py RENAMED Viewed

@@ -4,8 +4,8 @@
 The suite is a contract lifecycle (store -> draft -> review -> diff -> convert
 -> sign) that, until now, only handled documents it authored from its own
 templates. `extract-cli` is "passport control": it ingests ANY document --
-yours or a counterparty's foreign paper -- in .md/.txt (natively), .docx, or
-.pdf, and emits a structured JSON representation that the rest of the suite
+yours or a counterparty's foreign paper -- in .md/.txt/.html (natively), .docx,
+or .pdf, and emits a structured JSON representation that the rest of the suite
 (nda-review-cli, compare-cli, contract-vault) consumes.
 Two extraction tiers:
@@ -32,6 +32,7 @@ from __future__ import annotations
 import argparse
 import datetime as _dt
 import hashlib
+import html.parser
 import importlib.util
 import json
 import os
@@ -42,11 +43,11 @@ import urllib.request
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
-__version__ = "0.1.1"
+__version__ = "0.1.3"
 # Bumped independently of the package version when the *extraction logic*
 # changes in a way downstream consumers should notice. Embedded in `_meta`.
-EXTRACTOR_VERSION = "0.1.1"
+EXTRACTOR_VERSION = "0.1.3"
 # JSON Schema version of the output contract (docs/spec/extract-output.schema.json).
 SCHEMA_VERSION = 1
@@ -492,10 +493,17 @@ _EXPIRE_RE = re.compile(
     re.IGNORECASE,
 )
+# Each party must start with a capital letter (optionally "the X"), a quote, or
+# a paren. This is case-sensitive on purpose (no global IGNORECASE -- only the
+# keywords are): it lets the engine skip an "and" that sits INSIDE a party's own
+# description ("...V6E 3S7 and doing business as ...", where the right side
+# starts lowercase) and find the real "and" before the second named entity.
+_PARTY_START = r"(?:(?:[Tt]he|its)\s+)?[A-Z\"“(]"
 _PARTY_BLOCK_RE = re.compile(
-    r"\b(?:by\s+and\s+between|between)\s+(.{2,200}?)\s+\band\b\s+(.{2,200}?)"
-    r"(?=[\.;\n]|\bwhereas\b|\beffective\b|\bdated\b|\bhaving\b|\bwith\s+offices\b|$)",
-    re.IGNORECASE | re.DOTALL,
+    r"(?i:\b(?:by\s+and\s+between|between)\s+)"
+    r"(" + _PARTY_START + r"[^\n]{1,200}?)\s+and\s+"
+    r"(" + _PARTY_START + r"[^\n]{1,200}?)"
+    r"(?=[\.;\n]|(?i:\bwhereas\b|\beffective\b|\bdated\b|\bas\s+of\b|\bwitnesseth\b)|$)",
 )
 _ROLE_PAREN_RE = re.compile(
     r"\(\s*(?:the\s+)?[\"“]?([^\"”()]+?)[\"”]?\s*\)"
@@ -604,8 +612,47 @@ def _date_field(match: Optional["re.Match[str]"]) -> JSON:
     return _date_field_from_str(match.group(1), 0.85)
+# Trailing descriptors that follow a party's actual name and should be dropped
+# ("Acme Corp., a Delaware corporation", "... doing business as Foo", "... as of
+# March 1", "... having its offices at ..."). Each is matched and everything from
+# it onward is cut.
+_PARTY_CUT_MARKERS: Tuple[str, ...] = (
+    r",\s+an?\s+\w",                                  # ", a Delaware ..." / ", an Ohio ..."
+    r"\s+doing\s+business\s+as\b",
+    r"\s+d/?b/?a\b",
+    r"\s+f/?k/?a\b",
+    r"\s+a[n]?\s+\w+\s+(?:corporation|company|partnership|limited)\b",
+    r"\s+having\b",
+    r"\s+with\s+(?:its\s+)?(?:offices|principal|a\s)\b",
+    r"\s+with\s+offices\b",
+    r"\s+located\b",
+    r"\s+organized\b",
+    r"\s+incorporated\b",
+    r"\s+whose\b",
+    r"\s+together\b",
+    r",\s+as\s+\w",                                   # ", as administrative agent"
+    r"\s+(?:as\s+of|dated|effective)\b",
+)
+def _clean_party_name(s: str) -> str:
+    """Trim a captured party name down to the entity name, dropping trailing
+    descriptors ('a Delaware corporation', 'd/b/a ...', 'together with ...',
+    'as of ...') and any dangling unclosed parenthetical ('(each of them ...')."""
+    s = re.sub(r"\s+", " ", s).strip().strip(",").strip()
+    for pat in _PARTY_CUT_MARKERS:
+        m = re.search(pat, s, re.IGNORECASE)
+        if m:
+            s = s[: m.start()].strip().strip(",").strip()
+    # Drop a trailing parenthetical that was opened but never closed (the close
+    # fell outside the captured span), e.g. "Glenn Rufrano (each of them being".
+    if "(" in s and ")" not in s:
+        s = s[: s.index("(")].strip().strip(",").strip()
+    return s.strip("\"“”").strip()
 def _split_name_role(s: str) -> Tuple[str, Optional[str]]:
-    s = s.strip().strip(",").strip()
+    s = re.sub(r"\s+", " ", s).strip().strip(",").strip()
     role: Optional[str] = None
     m = _ROLE_PAREN_RE.search(s)
     if m:
@@ -614,9 +661,7 @@ def _split_name_role(s: str) -> Tuple[str, Optional[str]]:
         if len(candidate) <= 40 and candidate.lower() not in ("a", "an", "the"):
             role = candidate
         s = (s[: m.start()] + s[m.end():]).strip().rstrip(",").strip()
-    s = s.strip("\"“”").strip()
-    s = re.sub(r"\s+", " ", s)
-    return s, role
+    return _clean_party_name(s), role
 def extract_parties(text: str) -> List[JSON]:
@@ -625,9 +670,6 @@ def extract_parties(text: str) -> List[JSON]:
         return []
     out: List[JSON] = []
     for raw in (m.group(1), m.group(2)):
-        # Party names can wrap across lines ("...(the \"Disclosing\nParty\")");
-        # collapse whitespace rather than truncating at the first newline.
-        raw = re.sub(r"\s+", " ", raw).strip()
         name, role = _split_name_role(raw)
         if not name or len(name) < 2 or len(name) > 120:
             continue
@@ -706,9 +748,45 @@ def extract_defined_terms(text: str) -> List[JSON]:
     return [{"term": t, "confidence": 0.6, "source": "deterministic"} for t in seen]
+# Detected-heading titles that are almost never real clauses: front/back-matter,
+# page/document codes, exhibit & schedule references.
+_NOISE_TITLE_PREFIX_RE = re.compile(
+    r"^(?:table\s+of\s+contents|exhibit|schedule|annex|appendix|attachment|"
+    r"signature\s+page|page)\b",
+    re.IGNORECASE,
+)
+def _is_noise_clause_title(title: str) -> bool:
+    """True for detected 'headings' that are structural noise rather than
+    clauses -- document codes/page numbers (4+ consecutive digits, e.g.
+    'Ks 112708-2'), and front/back-matter like 'Table of Contents' or
+    'Exhibit B'. Safe filters only; kept conservative to avoid dropping real
+    clauses."""
+    t = title.strip()
+    if re.search(r"\d{4,}", t):
+        return True
+    if _NOISE_TITLE_PREFIX_RE.match(t):
+        return True
+    return False
 def extract_clauses(text: str) -> List[JSON]:
+    detected = detect_clauses(text)
+    # A heading whose title repeats 3+ times across the document is almost
+    # always a running header/footer (e.g. a page code), not that many distinct
+    # clauses -- drop every occurrence. (Counted on the normalized title.)
+    counts: Dict[str, int] = {}
+    for c in detected:
+        k = _norm_clause_key(c["title"])
+        counts[k] = counts.get(k, 0) + 1
     out: List[JSON] = []
-    for c in detect_clauses(text):
+    for c in detected:
+        if counts[_norm_clause_key(c["title"])] >= 3:
+            continue
+        if _is_noise_clause_title(c["title"]):
+            continue
         canonical, mapped = _canonicalize_clause(c["title"])
         tier = c["tier"]
         base = {"h2": 0.95, "bold-numbered": 0.85, "numbered": 0.8,
@@ -750,21 +828,91 @@ def extract_title(text: str, path: Optional[Path], fmt: str) -> Optional[str]:
 # ---------------------------------------------------------------------------
+def _looks_like_html(head: str) -> bool:
+    """Heuristic: does this text look like HTML? Catches HTML masquerading as
+    .txt (e.g. SEC EDGAR full submissions wrap HTML exhibits in a .txt)."""
+    low = head.lower()
+    if "<!doctype html" in low or "<html" in low or "<body" in low:
+        return True
+    return len(re.findall(r"</?(?:p|div|table|tr|td|span|br|h[1-6]|font|b|i)\b", low)) >= 6
 def _detect_format(path: Path, raw: bytes) -> str:
     ext = path.suffix.lower()
-    if ext in (".md", ".markdown"):
-        return "markdown"
-    if ext == ".txt":
-        return "text"
+    if ext in (".htm", ".html", ".xhtml"):
+        return "html"
     if ext == ".docx":
         return "docx"
     if ext == ".pdf":
         return "pdf"
     if raw[:4] == b"%PDF":
         return "pdf"
-    if raw[:2] == b"PK":
+    if raw[:2] == b"PK" and ext not in (".md", ".markdown", ".txt"):
         return "docx"
-    return "text"
+    base = "markdown" if ext in (".md", ".markdown") else "text"
+    # Content sniff: HTML hiding inside a .txt/.md (or extensionless) file.
+    if _looks_like_html(raw[:4096].decode("utf-8", "replace")):
+        return "html"
+    return base
+class _HTMLTextExtractor(html.parser.HTMLParser):
+    """Stdlib HTML -> text: drops script/style, frames block elements with blank
+    lines (so clause-heading detection still works), and unescapes entities."""
+    _SKIP = {"script", "style", "head", "title", "meta", "link", "noscript"}
+    _BLOCK = {
+        "p", "div", "br", "tr", "li", "h1", "h2", "h3", "h4", "h5", "h6",
+        "section", "article", "table", "ul", "ol", "blockquote", "pre", "hr",
+        "thead", "tbody", "header", "footer", "main",
+    }
+    def __init__(self) -> None:
+        super().__init__(convert_charrefs=True)
+        self._parts: List[str] = []
+        self._skip = 0
+    def handle_starttag(self, tag: str, attrs: Any) -> None:
+        if tag in self._SKIP:
+            self._skip += 1
+        elif tag in self._BLOCK:
+            self._parts.append("\n")
+    def handle_endtag(self, tag: str) -> None:
+        if tag in self._SKIP and self._skip > 0:
+            self._skip -= 1
+        elif tag in self._BLOCK:
+            self._parts.append("\n")
+    def handle_data(self, data: str) -> None:
+        if self._skip == 0:
+            self._parts.append(data)
+    def get_text(self) -> str:
+        # Strip each line; collapse runs of blank lines to a single blank line
+        # (gives ALL-CAPS / numbered headings their blank-line frame).
+        lines = [re.sub(r"[ \t]+", " ", ln).strip() for ln in "".join(self._parts).split("\n")]
+        out: List[str] = []
+        blank = False
+        for ln in lines:
+            if ln:
+                out.append(ln)
+                blank = False
+            elif not blank:
+                out.append("")
+                blank = True
+        return "\n".join(out).strip()
+def _read_html(raw_text: str) -> str:
+    parser = _HTMLTextExtractor()
+    try:
+        parser.feed(raw_text)
+        parser.close()
+    except Exception:
+        # Never crash on malformed markup; fall back to a crude tag strip.
+        return re.sub(r"<[^>]+>", " ", raw_text)
+    return parser.get_text()
 def _read_docx(path: Path, raw: bytes, prefer_optional: bool = True) -> Tuple[str, List[str]]:
@@ -986,6 +1134,8 @@ def load_source(path: Path, prefer_optional: bool = True) -> Tuple[bytes, str, s
     warnings: List[str] = []
     if fmt in ("markdown", "text"):
         text = raw.decode("utf-8", "replace")
+    elif fmt == "html":
+        text = _read_html(raw.decode("utf-8", "replace"))
     elif fmt == "docx":
         text, w = _read_docx(path, raw, prefer_optional)
         warnings += w
@@ -1011,6 +1161,13 @@ def build_extraction(text: str, raw: bytes, fmt: str,
                      source_path: Optional[str]) -> JSON:
     """Run the deterministic tier and assemble the output contract object."""
     sha = hashlib.sha256(raw).hexdigest()
+    # Field extractors (parties, dates, governing law, term, value, defined
+    # terms) run on a whitespace-flattened copy so values that wrap across a
+    # line break in the source -- "...laws of the Province\nof Ontario", a party
+    # name split mid-line -- are matched whole. Clause detection and the title
+    # keep the original text, which depends on line structure.
+    flat = re.sub(r"[ \t\r\f\v]*\n[ \t\r\f\v]*", " ", text)
+    flat = re.sub(r"[ \t]+", " ", flat)
     return {
         "document": {
             "title": extract_title(text, Path(source_path) if source_path else None, fmt),
@@ -1018,13 +1175,13 @@ def build_extraction(text: str, raw: bytes, fmt: str,
             "sha256": sha,
             "source_path": source_path,
         },
-        "parties": extract_parties(text),
-        "dates": extract_dates(text),
-        "term": extract_term(text),
-        "governing_law": extract_governing_law(text),
+        "parties": extract_parties(flat),
+        "dates": extract_dates(flat),
+        "term": extract_term(flat),
+        "governing_law": extract_governing_law(flat),
         "clauses": extract_clauses(text),
-        "defined_terms": extract_defined_terms(text),
-        "value": extract_value(text),
+        "defined_terms": extract_defined_terms(flat),
+        "value": extract_value(flat),
         "_meta": {
             "extractor_version": EXTRACTOR_VERSION,
             "tiers_used": ["deterministic"],
@@ -1336,7 +1493,7 @@ def output_schema() -> JSON:
                 "required": ["title", "format", "sha256", "source_path"],
                 "properties": {
                     "title": {"type": ["string", "null"]},
-                    "format": {"enum": ["markdown", "text", "docx", "pdf"]},
+                    "format": {"enum": ["markdown", "text", "docx", "pdf", "html"]},
                     "sha256": {"type": "string", "pattern": "^[0-9a-f]{64}$"},
                     "source_path": {"type": ["string", "null"]},
                 },
@@ -1687,7 +1844,7 @@ def _add_common_output_flags(p: argparse.ArgumentParser) -> None:
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(
         prog="extract",
-        description="Ingest any contract (.md/.txt/.docx/.pdf) and emit structured "
+        description="Ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured "
                     "JSON for the contract-ops CLI suite. See docs/INTEROP.md.",
     )
     parser.add_argument("-V", "--version", action="version",
@@ -1721,7 +1878,7 @@ def build_parser() -> argparse.ArgumentParser:
 def _build_extract_args(p: argparse.ArgumentParser) -> None:
-    p.add_argument("path", help="Path to the document (.md/.txt/.docx/.pdf).")
+    p.add_argument("path", help="Path to the document (.md/.txt/.html/.docx/.pdf).")
     p.add_argument("--llm", action="store_true",
                    help="Opt-in LLM enrichment of fuzzy fields (renewal, obligations). "
                         "Off by default; the deterministic core is fully useful without it.")

{extract_cli-0.1.1 → extract_cli-0.1.3}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
 [project]
 name = "extract-cli"
-version = "0.1.1"
-description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON."
+version = "0.1.3"
+description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON."
 readme = "README.md"
 requires-python = ">=3.9"
 license = { text = "MIT" }

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/_make_goldens.py RENAMED Viewed

@@ -20,7 +20,8 @@ from tests._fixtures_build import ensure_binary_fixtures  # noqa: E402
 FIXTURES = Path(__file__).resolve().parent / "fixtures"
 DOCS = ["nda_h2.md", "services_bold.txt", "lease_allcaps.txt",
-        "employment_docx.docx", "license_pdf.pdf", "scanned.pdf"]
+        "employment_docx.docx", "license_pdf.pdf", "services_html.html",
+        "scanned.pdf"]
 def golden_for(name: str) -> dict:

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/conftest.py RENAMED Viewed

@@ -26,6 +26,7 @@ CORPUS: Tuple[Tuple[str, str, str], ...] = (
     ("lease_allcaps.txt", "all-caps", "text"),
     ("employment_docx.docx", "bold-numbered", "docx"),
     ("license_pdf.pdf", "all-caps", "pdf"),
+    ("services_html.html", "numbered", "html"),
 )

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/employment_docx.docx.expected.json RENAMED Viewed

@@ -138,7 +138,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/lease_allcaps.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/license_pdf.pdf.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/nda_h2.md.expected.json RENAMED Viewed

@@ -121,6 +121,11 @@
       "confidence": 0.6,
       "source": "deterministic"
     },
+    {
+      "term": "Disclosing Party",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
     {
       "term": "Receiving Party",
       "confidence": 0.6,
@@ -138,7 +143,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/scanned.pdf.expected.json RENAMED Viewed

@@ -48,7 +48,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/fixtures/services_bold.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.1",
+    "extractor_version": "0.1.3",
     "tiers_used": [
       "deterministic"
     ],

extract_cli-0.1.3/tests/fixtures/services_html.html ADDED Viewed

@@ -0,0 +1,35 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <title>Exhibit 10.1</title>
+  <style>body { font-family: serif; } .hidden { display:none; }</style>
+  <script>var x = "(this should never appear in output)";</script>
+</head>
+<body>
+  <p align="center"><b>MASTER SERVICES AGREEMENT</b></p>
+  <p>This Master Services Agreement (the &ldquo;Agreement&rdquo;) is entered
+  into as of March 15, 2023 (the &quot;Effective Date&quot;), by and between
+  Initrode&nbsp;Systems,&nbsp;Inc., a Delaware corporation (&ldquo;Provider&rdquo;),
+  and Hooli&nbsp;LLC (&ldquo;Customer&rdquo;).</p>
+  <p>1. Services</p>
+  <p>Provider shall perform the services described in each Statement of Work.</p>
+  <p>2. Fees and Payment</p>
+  <p>Customer shall pay Provider the fees set forth in the applicable Statement
+  of Work, not to exceed $500,000 in the aggregate.</p>
+  <p>3. Term and Termination</p>
+  <p>The initial term of this Agreement is two (2) years. Either party may
+  terminate upon sixty (60) days&rsquo; written notice. This Agreement shall
+  automatically renew for successive one-year terms.</p>
+  <p>4. Confidentiality</p>
+  <p>Each party shall protect the other&rsquo;s &ldquo;Confidential
+  Information&rdquo; using reasonable care.</p>
+  <p>5. Governing Law</p>
+  <p>This Agreement shall be governed by the laws of the State of California.</p>
+</body>
+</html>

extract_cli-0.1.3/tests/fixtures/services_html.html.expected.json ADDED Viewed

@@ -0,0 +1,157 @@
+{
+  "document": {
+    "title": "MASTER SERVICES AGREEMENT",
+    "format": "html",
+    "sha256": "088b40f13135e6b5d8f8548b162d657f10725d348388c7c3a416d11d7fc65300",
+    "source_path": "services_html.html"
+  },
+  "parties": [
+    {
+      "name": "Initrode Systems, Inc.",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": "Provider"
+    },
+    {
+      "name": "Hooli LLC",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": "Customer"
+    }
+  ],
+  "dates": {
+    "effective": {
+      "value": "2023-03-15",
+      "confidence": 0.9,
+      "source": "deterministic"
+    },
+    "expiration": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    }
+  },
+  "term": {
+    "length": {
+      "value": "2 years",
+      "confidence": 0.7,
+      "source": "deterministic"
+    },
+    "auto_renew": {
+      "value": true,
+      "confidence": 0.65,
+      "source": "deterministic"
+    },
+    "notice_period_days": {
+      "value": 60,
+      "confidence": 0.7,
+      "source": "deterministic"
+    }
+  },
+  "governing_law": {
+    "value": "State of California",
+    "confidence": 0.85,
+    "source": "deterministic"
+  },
+  "clauses": [
+    {
+      "canonical_title": "Services",
+      "detected_title": "1. Services",
+      "tier": "numbered",
+      "span": {
+        "start": 242,
+        "end": 329
+      },
+      "confidence": 0.6,
+      "source": "deterministic",
+      "mapped": false
+    },
+    {
+      "canonical_title": "Payment",
+      "detected_title": "2. Fees and Payment",
+      "tier": "numbered",
+      "span": {
+        "start": 329,
+        "end": 476
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Termination",
+      "detected_title": "3. Term and Termination",
+      "tier": "numbered",
+      "span": {
+        "start": 476,
+        "end": 692
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Confidentiality",
+      "detected_title": "4. Confidentiality",
+      "tier": "numbered",
+      "span": {
+        "start": 692,
+        "end": 800
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Governing Law",
+      "detected_title": "5. Governing Law",
+      "tier": "numbered",
+      "span": {
+        "start": 800,
+        "end": 890
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    }
+  ],
+  "defined_terms": [
+    {
+      "term": "Agreement",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Effective Date",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Provider",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Customer",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Confidential Information",
+      "confidence": 0.6,
+      "source": "deterministic"
+    }
+  ],
+  "value": {
+    "value": "$500,000",
+    "confidence": 0.6,
+    "source": "deterministic"
+  },
+  "_meta": {
+    "extractor_version": "0.1.3",
+    "tiers_used": [
+      "deterministic"
+    ],
+    "llm_used": false
+  }
+}

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/test_clause_map.py RENAMED Viewed

@@ -69,6 +69,31 @@ def test_trailing_period_stripped_from_titles() -> None:
     assert ex._canonicalize_clause("Survival.") == ("Survival", True)
+def test_repeated_heading_treated_as_boilerplate() -> None:
+    # A "heading" that repeats 3+ times is a running header/footer, not clauses.
+    body = "\n\n".join("## Ks 99-2\n\nfoo" for _ in range(4))
+    text = "## Confidentiality\n\nreal body\n\n" + body
+    clauses = ex.extract_clauses(text)
+    titles = [c["canonical_title"] for c in clauses]
+    assert "Confidentiality" in titles
+    assert not any("Ks" in (t or "") for t in titles)
+def test_noise_clause_titles_filtered() -> None:
+    assert ex._is_noise_clause_title("Ks 112708-2")       # 4+ digit code
+    assert ex._is_noise_clause_title("Table of Contents")
+    assert ex._is_noise_clause_title("Exhibit B")
+    assert ex._is_noise_clause_title("Schedule 2.1")
+    assert not ex._is_noise_clause_title("Confidentiality")
+    assert not ex._is_noise_clause_title("Term and Termination")
+def test_party_cuts_together_as_agent_and_unclosed_paren() -> None:
+    assert ex._clean_party_name("Foo LLC, together with its affiliates") == "Foo LLC"
+    assert ex._clean_party_name("GE Capital Corporation, as administrative agent") == "GE Capital Corporation"
+    assert ex._clean_party_name("Glenn Rufrano (each of them being") == "Glenn Rufrano"
 def test_cascade_priority_h2_wins() -> None:
     # An H2 present means the bold/all-caps fallbacks must not fire.
     text = "## Real Heading\n\n**1. Not A Heading**\n\nALSO NOT A HEADING\n\nbody"

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/test_deterministic.py RENAMED Viewed

@@ -12,8 +12,8 @@ def test_parties_between_simple() -> None:
     assert all(0.0 <= p["confidence"] <= 1.0 for p in parties)
-def test_parties_with_roles_and_linebreak() -> None:
-    text = ('by and between Acme Corp. (the "Disclosing\nParty") and '
+def test_parties_with_roles() -> None:
+    text = ('by and between Acme Corp. (the "Disclosing Party") and '
             'Beta LLC (the "Receiving Party"), dated March 1, 2024.')
     parties = ex.extract_parties(text)
     assert parties[0]["name"] == "Acme Corp."
@@ -22,6 +22,30 @@ def test_parties_with_roles_and_linebreak() -> None:
     assert parties[1]["role"] == "Receiving Party"
+def test_parties_linebreak_handled_by_build() -> None:
+    # build_extraction flattens whitespace, so a party/role that wraps across a
+    # line is matched whole.
+    text = ('This Agreement is made by and between Acme Corp. (the "Disclosing\n'
+            'Party") and Beta LLC (the "Receiving Party").')
+    r = ex.build_extraction(text, text.encode("utf-8"), "text", "x.txt")
+    assert [p["name"] for p in r["parties"]] == ["Acme Corp.", "Beta LLC"]
+    assert r["parties"][0]["role"] == "Disclosing Party"
+def test_parties_skip_and_inside_description() -> None:
+    # An "and" inside a party's own description must not split the parties.
+    text = ("between Blade Ventures Inc., a Nevada corporation having offices at "
+            "1 Main St and doing business as Foo (\"Client\"), and KPMG LP")
+    parties = ex.extract_parties(text)
+    assert [p["name"] for p in parties] == ["Blade Ventures Inc.", "KPMG LP"]
+def test_party_name_descriptors_trimmed() -> None:
+    assert ex._clean_party_name("Visteon Corporation, a Delaware corporation") == "Visteon Corporation"
+    assert ex._clean_party_name("Foo Inc. doing business as Bar") == "Foo Inc."
+    assert ex._clean_party_name("Baz LLC having its principal office at X") == "Baz LLC"
 def test_parties_none() -> None:
     assert ex.extract_parties("There are no parties named here.") == []
@@ -80,6 +104,15 @@ def test_governing_law_stops_before_trailing_clause() -> None:
     assert out["value"] == "State of Delaware"
+def test_governing_law_linebreak_handled_by_build() -> None:
+    # A jurisdiction that wraps a line ("...the Province\nof Ontario") is
+    # matched whole because build_extraction flattens whitespace first.
+    text = ("This Agreement shall be governed by the laws of the Province\n"
+            "of Ontario and the federal laws of Canada.")
+    r = ex.build_extraction(text, text.encode("utf-8"), "text", "x.txt")
+    assert r["governing_law"]["value"] == "Province of Ontario"
 def test_governing_law_missing() -> None:
     assert ex.extract_governing_law("nothing about law")["source"] == "none"

{extract_cli-0.1.1 → extract_cli-0.1.3}/tests/test_misc.py RENAMED Viewed

@@ -142,6 +142,32 @@ def test_pdf_unescape() -> None:
     assert ex._pdf_unescape(r"\101\102") == "AB"  # octal escapes
+def test_html_extraction() -> None:
+    raw, text, fmt, _w = ex.load_source(FIXTURES / "services_html.html")
+    assert fmt == "html"
+    # script/style content is dropped; entities are unescaped.
+    assert "this should never appear" not in text
+    result = ex.build_extraction(text, raw, fmt, "services_html.html")
+    assert result["document"]["format"] == "html"
+    assert [p["name"] for p in result["parties"]] == ["Initrode Systems, Inc.", "Hooli LLC"]
+    assert result["governing_law"]["value"] == "State of California"
+    assert result["dates"]["effective"]["value"] == "2023-03-15"
+    canon = {c["canonical_title"] for c in result["clauses"]}
+    assert {"Payment", "Termination", "Confidentiality", "Governing Law"} <= canon
+def test_html_detected_by_content_sniff(tmp_path: Any) -> None:
+    # HTML masquerading as .txt (e.g. a SEC EDGAR full submission) is sniffed.
+    p = tmp_path / "exhibit.txt"
+    p.write_text("<html><body><p>between A Co and B Co</p></body></html>")
+    _raw, _text, fmt, _w = ex.load_source(p)
+    assert fmt == "html"
+def test_html_malformed_does_not_crash() -> None:
+    assert ex._read_html("<p>unclosed <b>bold <div>text") is not None
 def test_pdf_text_only_inside_bt_et() -> None:
     # Strings outside BT/ET (font/signature/metadata stream bytes that happen to
     # contain parentheses) must be ignored; only text objects yield text.