PyPI - extract-cli - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

extract-cli 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

{extract_cli-0.1.0 → extract_cli-0.1.2}/ARCHITECTURE.md RENAMED Viewed

@@ -8,11 +8,14 @@ map.
 ```
 load_source(path)                       extension/content sniff → reader
   ├─ .md/.txt  → utf-8 decode
+  ├─ .html     → stdlib html.parser reader (also auto-detected inside .txt)
   ├─ .docx     → python-docx (if [docx]) else stdlib zipfile/XML reader
   └─ .pdf      → pypdf (if [pdf]) else stdlib zlib + text-operator reader
         │
         ▼  (raw_bytes, text, format, warnings)
 build_extraction(text, raw, fmt, src)   the DETERMINISTIC tier (always on)
+  │  field extractors run on a whitespace-FLATTENED copy (so values that wrap
+  │  across a line are matched whole); clause detection keeps the original text
   ├─ extract_parties          "between X and Y", with role parentheticals
   ├─ extract_dates            effective / expiration, ISO-normalized
   ├─ extract_term             length / auto_renew / notice_period_days

{extract_cli-0.1.0 → extract_cli-0.1.2}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,64 @@ to [Semantic Versioning](https://semver.org/). Per the suite convention
 (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
 the output schema require a major version bump**; new optional fields are minor.
+## [0.1.2] - 2026-05-21
+More real-world hardening, driven by testing against five additional contracts
+(SEC EDGAR consulting/MSA, lease, and Visteon services agreements; Common Paper
+and Perigon Cloud Service Agreements).
+### Added
+- **HTML input** (`.html`/`.htm`, and HTML auto-detected inside `.txt` such as
+  SEC EDGAR full submissions). Stdlib `html.parser`-based reader strips
+  script/style, frames block elements so heading detection still works, and
+  unescapes entities. `document.format` enum gains `html` (backward-compatible
+  widening). This turns the large class of HTML contracts (SEC exhibits, web
+  ToS) from garbage into structured output.
+### Fixed
+- **Field extraction now runs on whitespace-flattened text**, so values that
+  wrap across a line break are matched whole — e.g. governing law
+  `the laws of the Province\nof Ontario` now yields `Province of Ontario`, and
+  line-wrapped party names/defined terms are captured.
+- **Party extraction** (continues issue #2): names are trimmed of trailing
+  descriptors (`, a Delaware corporation`, `doing business as …`,
+  `having its offices at …`, `as of …`), and each party must begin with a
+  capital so an `and` *inside* a party's own description no longer splits the
+  parties (`…V6E 3S7 and doing business as …` → real parties recovered).
+### Known limitations (documented, not bugs)
+- The stdlib PDF reader cannot decode PDFs that use embedded subset fonts with
+  hex-encoded glyph strings (common in professionally-typeset PDFs); these
+  degrade gracefully to a low-signal warning. Install the `[pdf]` extra (pypdf)
+  for them — verified to recover full text and clause structure.
+- Two-line `ARTICLE N` / title headings (number on one line, title on the next)
+  are not yet detected.
+## [0.1.1] - 2026-05-21
+Real-world hardening, driven by testing against a SEC EDGAR employment
+agreement and the Common Paper Mutual NDA (PDF/DOCX).
+### Added
+- **`numbered` clause-detection tier** for plain numbered headings
+  (`1. Termination`, `Section 3. Payment`, `Article IV. …`) — the dominant
+  format in foreign paper, missed by the H2/bold/ALL-CAPS tiers. A title-case
+  heuristic rejects numbered sentences and list items. The output schema's
+  clause `tier` enum gains `numbered` (a backward-compatible widening).
+### Fixed
+- **PDF reader** now extracts text only from inside `BT … ET` text objects, so
+  embedded fonts, digital-signature blobs, and metadata streams no longer leak
+  binary noise (a real signed PDF dropped from ~188 KB of garbage to ~8.7 KB of
+  clean text). Added a printable-ratio backstop.
+- **Effective date**: anchor on `(the "Effective Date")` and a bare
+  `as of <date>` cue; handle dates that wrap across a line break.
+- **Term length**: require a real number, dropping false positives such as
+  `…consecutive days`.
+- **Title**: skip SGML/XML wrapper lines (e.g. SEC EDGAR `<DOCUMENT>` headers).
+- Strip trailing punctuation from clause titles (`Other Benefits.` →
+  `Other Benefits`).
 ## [0.1.0] - 2026-05-21
 Initial release — the open-loop front door of the contract-ops CLI suite.
@@ -57,4 +115,6 @@ Initial release — the open-loop front door of the contract-ops CLI suite.
   intentionally *not* governed by the output schema (the schema describes the
   full default output).
+[0.1.2]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.2
+[0.1.1]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.1
 [0.1.0]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.0

{extract_cli-0.1.0 → extract_cli-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: extract-cli
-Version: 0.1.0
-Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON.
+Version: 0.1.2
+Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.
 Project-URL: Homepage, https://cli.drbaher.com/
 Project-URL: Repository, https://github.com/DrBaher/extract-cli
 Project-URL: Suite interop, https://github.com/DrBaher/extract-cli/blob/main/docs/INTEROP.md
@@ -63,8 +63,8 @@ ingest (extract) → review → diff → convert → sign
 ## What it does
-Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
-and it returns structured JSON: the parties, dates, term, governing law, a
+Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
+**`.pdf`**, and it returns structured JSON: the parties, dates, term, governing law, a
 **clause map** normalized onto the suite's canonical clause vocabulary, a
 defined-term inventory, and a headline value. Every field carries a
 `confidence` and a `source` so downstream tools **verify, don't trust**.
@@ -75,14 +75,15 @@ daemon, no network in the default path.
 ## Install
 ```bash
-pip install extract-cli                 # core: .md/.txt + best-effort .docx/.pdf
+pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
 pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
 pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
 pip install "extract-cli[docx,pdf]"     # both
 ```
-The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
-with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
+The core has **zero runtime dependencies** and is fully functional on
+`.md`/`.txt`/`.html` with no extras (HTML is also auto-detected when it hides
+inside a `.txt`, e.g. SEC EDGAR filings). `.docx` and `.pdf` work out of the box via stdlib readers; the
 `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
 [ARCHITECTURE.md](ARCHITECTURE.md)).

{extract_cli-0.1.0 → extract_cli-0.1.2}/README.md RENAMED Viewed

@@ -25,8 +25,8 @@ ingest (extract) → review → diff → convert → sign
 ## What it does
-Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
-and it returns structured JSON: the parties, dates, term, governing law, a
+Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
+**`.pdf`**, and it returns structured JSON: the parties, dates, term, governing law, a
 **clause map** normalized onto the suite's canonical clause vocabulary, a
 defined-term inventory, and a headline value. Every field carries a
 `confidence` and a `source` so downstream tools **verify, don't trust**.
@@ -37,14 +37,15 @@ daemon, no network in the default path.
 ## Install
 ```bash
-pip install extract-cli                 # core: .md/.txt + best-effort .docx/.pdf
+pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
 pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
 pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
 pip install "extract-cli[docx,pdf]"     # both
 ```
-The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
-with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
+The core has **zero runtime dependencies** and is fully functional on
+`.md`/`.txt`/`.html` with no extras (HTML is also auto-detected when it hides
+inside a `.txt`, e.g. SEC EDGAR filings). `.docx` and `.pdf` work out of the box via stdlib readers; the
 `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
 [ARCHITECTURE.md](ARCHITECTURE.md)).

{extract_cli-0.1.0 → extract_cli-0.1.2}/docs/spec/extract-output.schema.json RENAMED Viewed

@@ -69,7 +69,8 @@
             "markdown",
             "text",
             "docx",
-            "pdf"
+            "pdf",
+            "html"
           ]
         },
         "sha256": {
@@ -183,6 +184,7 @@
             "enum": [
               "h2",
               "bold-numbered",
+              "numbered",
               "all-caps",
               "explicit",
               "llm"

{extract_cli-0.1.0 → extract_cli-0.1.2}/extract_cli.py RENAMED Viewed

@@ -4,8 +4,8 @@
 The suite is a contract lifecycle (store -> draft -> review -> diff -> convert
 -> sign) that, until now, only handled documents it authored from its own
 templates. `extract-cli` is "passport control": it ingests ANY document --
-yours or a counterparty's foreign paper -- in .md/.txt (natively), .docx, or
-.pdf, and emits a structured JSON representation that the rest of the suite
+yours or a counterparty's foreign paper -- in .md/.txt/.html (natively), .docx,
+or .pdf, and emits a structured JSON representation that the rest of the suite
 (nda-review-cli, compare-cli, contract-vault) consumes.
 Two extraction tiers:
@@ -32,6 +32,7 @@ from __future__ import annotations
 import argparse
 import datetime as _dt
 import hashlib
+import html.parser
 import importlib.util
 import json
 import os
@@ -42,11 +43,11 @@ import urllib.request
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
-__version__ = "0.1.0"
+__version__ = "0.1.2"
 # Bumped independently of the package version when the *extraction logic*
 # changes in a way downstream consumers should notice. Embedded in `_meta`.
-EXTRACTOR_VERSION = "0.1.0"
+EXTRACTOR_VERSION = "0.1.2"
 # JSON Schema version of the output contract (docs/spec/extract-output.schema.json).
 SCHEMA_VERSION = 1
@@ -214,6 +215,49 @@ def _qualifies_as_all_caps_heading(title: str) -> bool:
     return sum(1 for ch in title if "A" <= ch <= "Z") >= 4
+# Tier between bold-numbered and ALL-CAPS: plain numbered headings on their own
+# line -- "1. Termination", "5. Wage Compensation", "Section 3. Payment",
+# "Article IV. Confidentiality". These are the dominant real-world format in
+# foreign paper (and aren't caught by H2, **bold**, or ALL-CAPS). A title-case
+# heuristic distinguishes a heading from a numbered *sentence* or list item.
+_NUMBERED_HEADING_RE = re.compile(
+    r"^[ \t]*"
+    r"(?:(?:Article|Section|ARTICLE|SECTION)[ \t]+)?"
+    r"(?:" + _ROMAN_RE + r"|\d{1,2})\.?"
+    r"[ \t]+"
+    r"([A-Z][A-Za-z][^\n]{0,58})"
+    r"[ \t]*$",
+    re.MULTILINE,
+)
+# Lowercase words allowed inside an otherwise Title-Cased heading.
+_HEADING_STOPWORDS = {
+    "a", "an", "the", "and", "or", "of", "to", "for", "in", "on", "with",
+    "by", "at", "as", "per", "from", "into", "nor", "but",
+}
+def _qualifies_as_numbered_heading(title: str) -> bool:
+    """A numbered line qualifies as a heading only if its title looks like a
+    heading: 1-9 words, Title-Cased (every word starts uppercase or is a short
+    lowercase connector), no sentence-y lowercase verbs. A single word must be
+    >= 4 letters. Rejects 'The parties agree as follows' but accepts 'Wage
+    Compensation' and 'Term And Nature Of Employment'."""
+    t = title.strip().rstrip(".").strip()
+    words = t.split()
+    if not (1 <= len(words) <= 9):
+        return False
+    if len(words) == 1:
+        return sum(1 for ch in words[0] if ch.isalpha()) >= 4 and words[0][:1].isupper()
+    for w in words:
+        if w[:1].isupper() or not w[:1].isalpha():
+            continue  # capitalized word, or punctuation/number token
+        if w.lower() in _HEADING_STOPWORDS:
+            continue  # allowed connector
+        return False  # a lowercase content word => this is a sentence, not a heading
+    return True
 def detect_clauses(text: str) -> List[JSON]:
     """Run the three-tier cascade and return clauses with their detection tier.
@@ -227,6 +271,12 @@ def detect_clauses(text: str) -> List[JSON]:
     bold = list(_BOLD_HEADING_RE.finditer(text))
     if len(bold) >= 2:
         return _matches_to_clauses(text, bold, group=1, tier="bold-numbered")
+    numbered = [
+        m for m in _NUMBERED_HEADING_RE.finditer(text)
+        if _qualifies_as_numbered_heading(m.group(1))
+    ]
+    if len(numbered) >= 2:
+        return _matches_to_clauses(text, numbered, group=1, tier="numbered")
     caps = [
         m for m in _ALL_CAPS_HEADING_RE.finditer(text)
         if _qualifies_as_all_caps_heading(m.group(1))
@@ -266,8 +316,9 @@ def _matches_to_clauses(text: str, matches: List["re.Match[str]"], group: int,
 def _norm_clause_key(s: str) -> str:
-    """Normalize a clause title/alias for matching (number-stripped, lowercased)."""
-    return _strip_clause_number(s).strip().lower()
+    """Normalize a clause title/alias for matching (number-stripped, trailing
+    punctuation removed, lowercased)."""
+    return _strip_clause_number(s).strip().lower().rstrip(" .:;,")
 # ---------------------------------------------------------------------------
@@ -366,7 +417,7 @@ def _canonicalize_clause(detected_title: str) -> Tuple[Optional[str], bool]:
                 best, best_len = canonical, len(alias_key)
     if best is not None:
         return best, True
-    return _titlecase(detected_title), False
+    return _titlecase(detected_title.strip().rstrip(" .:;,")), False
 # ---------------------------------------------------------------------------
@@ -421,11 +472,17 @@ _DATE_PAT = (
 )
 _DATE_RE = re.compile(_DATE_PAT, re.IGNORECASE)
+# Highest-confidence: a date explicitly labeled "(the "Effective Date")".
+_EFFDATE_LABEL_RE = re.compile(
+    r"(" + _DATE_PAT + r")\s*\(\s*(?:the\s+)?[\"“]?\s*Effective\s+Date",
+    re.IGNORECASE,
+)
 _EFFECTIVE_RE = re.compile(
     r"(?:effective(?:\s+date)?(?:\s+(?:as\s+of|date|on))?|"
     r"dated(?:\s+as\s+of)?|"
     r"made(?:\s+and\s+entered\s+into)?(?:\s+as\s+of|\s+on)?|"
-    r"entered\s+into(?:\s+as\s+of|\s+on)?)"
+    r"entered\s+into(?:\s+as\s+of|\s+on)?|"
+    r"as\s+of)"
     r"[\s:,]+(?:the\s+)?(" + _DATE_PAT + r")",
     re.IGNORECASE,
 )
@@ -436,10 +493,17 @@ _EXPIRE_RE = re.compile(
     re.IGNORECASE,
 )
+# Each party must start with a capital letter (optionally "the X"), a quote, or
+# a paren. This is case-sensitive on purpose (no global IGNORECASE -- only the
+# keywords are): it lets the engine skip an "and" that sits INSIDE a party's own
+# description ("...V6E 3S7 and doing business as ...", where the right side
+# starts lowercase) and find the real "and" before the second named entity.
+_PARTY_START = r"(?:(?:[Tt]he|its)\s+)?[A-Z\"“(]"
 _PARTY_BLOCK_RE = re.compile(
-    r"\b(?:by\s+and\s+between|between)\s+(.{2,200}?)\s+\band\b\s+(.{2,200}?)"
-    r"(?=[\.;\n]|\bwhereas\b|\beffective\b|\bdated\b|\bhaving\b|\bwith\s+offices\b|$)",
-    re.IGNORECASE | re.DOTALL,
+    r"(?i:\b(?:by\s+and\s+between|between)\s+)"
+    r"(" + _PARTY_START + r"[^\n]{1,200}?)\s+and\s+"
+    r"(" + _PARTY_START + r"[^\n]{1,200}?)"
+    r"(?=[\.;\n]|(?i:\bwhereas\b|\beffective\b|\bdated\b|\bas\s+of\b|\bwitnesseth\b)|$)",
 )
 _ROLE_PAREN_RE = re.compile(
     r"\(\s*(?:the\s+)?[\"“]?([^\"”()]+?)[\"”]?\s*\)"
@@ -534,18 +598,54 @@ def _parse_date_to_iso(s: str) -> Optional[str]:
     return None
+def _date_field_from_str(raw: str, base_conf: float) -> JSON:
+    raw = re.sub(r"\s+", " ", raw.strip())
+    iso = _parse_date_to_iso(raw)
+    if iso is not None:
+        return _field(iso, base_conf)
+    return _field(raw, max(0.0, base_conf - 0.3))
 def _date_field(match: Optional["re.Match[str]"]) -> JSON:
     if match is None:
         return _none_field()
-    raw = match.group(1).strip()
-    iso = _parse_date_to_iso(raw)
-    if iso is not None:
-        return _field(iso, 0.85)
-    return _field(raw, 0.55)
+    return _date_field_from_str(match.group(1), 0.85)
+# Trailing descriptors that follow a party's actual name and should be dropped
+# ("Acme Corp., a Delaware corporation", "... doing business as Foo", "... as of
+# March 1", "... having its offices at ..."). Each is matched and everything from
+# it onward is cut.
+_PARTY_CUT_MARKERS: Tuple[str, ...] = (
+    r",\s+an?\s+\w",                                  # ", a Delaware ..." / ", an Ohio ..."
+    r"\s+doing\s+business\s+as\b",
+    r"\s+d/?b/?a\b",
+    r"\s+f/?k/?a\b",
+    r"\s+a[n]?\s+\w+\s+(?:corporation|company|partnership|limited)\b",
+    r"\s+having\b",
+    r"\s+with\s+(?:its\s+)?(?:offices|principal|a\s)\b",
+    r"\s+with\s+offices\b",
+    r"\s+located\b",
+    r"\s+organized\b",
+    r"\s+incorporated\b",
+    r"\s+whose\b",
+    r"\s+(?:as\s+of|dated|effective)\b",
+)
+def _clean_party_name(s: str) -> str:
+    """Trim a captured party name down to the entity name, dropping trailing
+    descriptors ('a Delaware corporation', 'd/b/a ...', 'as of ...')."""
+    s = re.sub(r"\s+", " ", s).strip().strip(",").strip()
+    for pat in _PARTY_CUT_MARKERS:
+        m = re.search(pat, s, re.IGNORECASE)
+        if m:
+            s = s[: m.start()].strip().strip(",").strip()
+    return s.strip("\"“”").strip()
 def _split_name_role(s: str) -> Tuple[str, Optional[str]]:
-    s = s.strip().strip(",").strip()
+    s = re.sub(r"\s+", " ", s).strip().strip(",").strip()
     role: Optional[str] = None
     m = _ROLE_PAREN_RE.search(s)
     if m:
@@ -554,9 +654,7 @@ def _split_name_role(s: str) -> Tuple[str, Optional[str]]:
         if len(candidate) <= 40 and candidate.lower() not in ("a", "an", "the"):
             role = candidate
         s = (s[: m.start()] + s[m.end():]).strip().rstrip(",").strip()
-    s = s.strip("\"“”").strip()
-    s = re.sub(r"\s+", " ", s)
-    return s, role
+    return _clean_party_name(s), role
 def extract_parties(text: str) -> List[JSON]:
@@ -565,9 +663,6 @@ def extract_parties(text: str) -> List[JSON]:
         return []
     out: List[JSON] = []
     for raw in (m.group(1), m.group(2)):
-        # Party names can wrap across lines ("...(the \"Disclosing\nParty\")");
-        # collapse whitespace rather than truncating at the first newline.
-        raw = re.sub(r"\s+", " ", raw).strip()
         name, role = _split_name_role(raw)
         if not name or len(name) < 2 or len(name) > 120:
             continue
@@ -578,10 +673,12 @@ def extract_parties(text: str) -> List[JSON]:
 def extract_dates(text: str) -> JSON:
-    return {
-        "effective": _date_field(_EFFECTIVE_RE.search(text)),
-        "expiration": _date_field(_EXPIRE_RE.search(text)),
-    }
+    label = _EFFDATE_LABEL_RE.search(text)
+    if label is not None:
+        effective = _date_field_from_str(label.group(1), 0.9)
+    else:
+        effective = _date_field(_EFFECTIVE_RE.search(text))
+    return {"effective": effective, "expiration": _date_field(_EXPIRE_RE.search(text))}
 def extract_governing_law(text: str) -> JSON:
@@ -600,10 +697,10 @@ def extract_term(text: str) -> JSON:
     if m:
         num = _word_to_int(m.group(1))
         unit = m.group(2).lower().rstrip("s")
+        # Only emit when the captured token is a real number; otherwise the
+        # match was a coincidence ("...consecutive days") -> leave as not-found.
         if num is not None:
             length = _field(f"{num} {unit}{'s' if num != 1 else ''}", 0.7)
-        else:
-            length = _field(f"{m.group(1)} {m.group(2)}".strip(), 0.5)
     notice = _none_field()
     nm = _NOTICE_RE.search(text)
@@ -649,7 +746,8 @@ def extract_clauses(text: str) -> List[JSON]:
     for c in detect_clauses(text):
         canonical, mapped = _canonicalize_clause(c["title"])
         tier = c["tier"]
-        base = {"h2": 0.95, "bold-numbered": 0.85, "all-caps": 0.75, "explicit": 0.95}.get(tier, 0.7)
+        base = {"h2": 0.95, "bold-numbered": 0.85, "numbered": 0.8,
+                "all-caps": 0.75, "explicit": 0.95}.get(tier, 0.7)
         conf = round(base * (1.0 if mapped else 0.75), 2)
         out.append({
             "canonical_title": canonical,
@@ -669,10 +767,14 @@ def extract_title(text: str, path: Optional[Path], fmt: str) -> Optional[str]:
         return m.group(1).strip()
     for line in text.splitlines():
         ls = line.strip().lstrip("#").strip()
-        if ls:
-            if len(ls) <= 90:
-                return ls
-            break
+        if not ls:
+            continue
+        # Skip SGML/XML wrapper lines (e.g. SEC EDGAR "<DOCUMENT>", "<TYPE>...").
+        if ls.startswith("<"):
+            continue
+        if len(ls) <= 90:
+            return ls
+        break
     if path is not None:
         return _titlecase(path.stem.replace("_", " ").replace("-", " "))
     return None
@@ -683,21 +785,91 @@ def extract_title(text: str, path: Optional[Path], fmt: str) -> Optional[str]:
 # ---------------------------------------------------------------------------
+def _looks_like_html(head: str) -> bool:
+    """Heuristic: does this text look like HTML? Catches HTML masquerading as
+    .txt (e.g. SEC EDGAR full submissions wrap HTML exhibits in a .txt)."""
+    low = head.lower()
+    if "<!doctype html" in low or "<html" in low or "<body" in low:
+        return True
+    return len(re.findall(r"</?(?:p|div|table|tr|td|span|br|h[1-6]|font|b|i)\b", low)) >= 6
 def _detect_format(path: Path, raw: bytes) -> str:
     ext = path.suffix.lower()
-    if ext in (".md", ".markdown"):
-        return "markdown"
-    if ext == ".txt":
-        return "text"
+    if ext in (".htm", ".html", ".xhtml"):
+        return "html"
     if ext == ".docx":
         return "docx"
     if ext == ".pdf":
         return "pdf"
     if raw[:4] == b"%PDF":
         return "pdf"
-    if raw[:2] == b"PK":
+    if raw[:2] == b"PK" and ext not in (".md", ".markdown", ".txt"):
         return "docx"
-    return "text"
+    base = "markdown" if ext in (".md", ".markdown") else "text"
+    # Content sniff: HTML hiding inside a .txt/.md (or extensionless) file.
+    if _looks_like_html(raw[:4096].decode("utf-8", "replace")):
+        return "html"
+    return base
+class _HTMLTextExtractor(html.parser.HTMLParser):
+    """Stdlib HTML -> text: drops script/style, frames block elements with blank
+    lines (so clause-heading detection still works), and unescapes entities."""
+    _SKIP = {"script", "style", "head", "title", "meta", "link", "noscript"}
+    _BLOCK = {
+        "p", "div", "br", "tr", "li", "h1", "h2", "h3", "h4", "h5", "h6",
+        "section", "article", "table", "ul", "ol", "blockquote", "pre", "hr",
+        "thead", "tbody", "header", "footer", "main",
+    }
+    def __init__(self) -> None:
+        super().__init__(convert_charrefs=True)
+        self._parts: List[str] = []
+        self._skip = 0
+    def handle_starttag(self, tag: str, attrs: Any) -> None:
+        if tag in self._SKIP:
+            self._skip += 1
+        elif tag in self._BLOCK:
+            self._parts.append("\n")
+    def handle_endtag(self, tag: str) -> None:
+        if tag in self._SKIP and self._skip > 0:
+            self._skip -= 1
+        elif tag in self._BLOCK:
+            self._parts.append("\n")
+    def handle_data(self, data: str) -> None:
+        if self._skip == 0:
+            self._parts.append(data)
+    def get_text(self) -> str:
+        # Strip each line; collapse runs of blank lines to a single blank line
+        # (gives ALL-CAPS / numbered headings their blank-line frame).
+        lines = [re.sub(r"[ \t]+", " ", ln).strip() for ln in "".join(self._parts).split("\n")]
+        out: List[str] = []
+        blank = False
+        for ln in lines:
+            if ln:
+                out.append(ln)
+                blank = False
+            elif not blank:
+                out.append("")
+                blank = True
+        return "\n".join(out).strip()
+def _read_html(raw_text: str) -> str:
+    parser = _HTMLTextExtractor()
+    try:
+        parser.feed(raw_text)
+        parser.close()
+    except Exception:
+        # Never crash on malformed markup; fall back to a crude tag strip.
+        return re.sub(r"<[^>]+>", " ", raw_text)
+    return parser.get_text()
 def _read_docx(path: Path, raw: bytes, prefer_optional: bool = True) -> Tuple[str, List[str]]:
@@ -834,9 +1006,15 @@ def _pdf_unescape(s: str) -> str:
 def _pdf_text_from_content(content: bytes) -> str:
+    """Pull text strings from a PDF content stream, but ONLY from inside text
+    objects (`BT` ... `ET`). Real text lives there; embedded fonts, images,
+    digital-signature blobs and metadata streams have no BT/ET, so gating on it
+    keeps their binary bytes (which often contain stray `(...)` sequences) out
+    of the output -- essential for real signed/font-embedded PDFs."""
     s = content.decode("latin-1", "replace")
     lines: List[str] = []
     cur: List[str] = []
+    in_text = False
     def flush() -> None:
         if cur:
@@ -845,17 +1023,34 @@ def _pdf_text_from_content(content: bytes) -> str:
     for m in _PDF_TOKEN_RE.finditer(s):
         tok = m.group(0)
-        if tok.startswith("("):
+        if tok == "BT":
+            flush()
+            in_text = True
+        elif tok == "ET":
+            flush()
+            in_text = False
+        elif not in_text:
+            continue
+        elif tok.startswith("("):
             cur.append(_pdf_unescape(tok[1:-1]))
         elif tok.startswith("["):
             for sm in re.finditer(r"\((?:\\.|[^\\()])*\)", tok):
                 cur.append(_pdf_unescape(sm.group(0)[1:-1]))
-        elif tok in ("Td", "TD", "T*", "'", '"', "BT", "ET"):
+        elif tok in ("Td", "TD", "T*", "'", '"'):
             flush()
     flush()
     return "\n".join(lines)
+def _mostly_printable(s: str) -> bool:
+    """True if `s` is overwhelmingly printable text (backstop against a
+    malformed stream slipping binary through the BT/ET gate)."""
+    if not s:
+        return False
+    printable = sum(1 for ch in s if ch in "\n\t" or 32 <= ord(ch) < 127 or ord(ch) > 160)
+    return printable / len(s) >= 0.85
 def _read_pdf_stdlib(raw: bytes) -> str:
     import zlib
@@ -873,9 +1068,11 @@ def _read_pdf_stdlib(raw: bytes) -> str:
             content = zlib.decompress(body)
         except Exception:
             content = body
-        chunks.append(_pdf_text_from_content(content))
+        piece = _pdf_text_from_content(content)
+        if piece.strip() and _mostly_printable(piece):
+            chunks.append(piece)
         idx = e + len(b"endstream")
-    return "\n".join(c for c in chunks if c.strip())
+    return "\n".join(chunks)
 def load_source(path: Path, prefer_optional: bool = True) -> Tuple[bytes, str, str, List[str]]:
@@ -894,6 +1091,8 @@ def load_source(path: Path, prefer_optional: bool = True) -> Tuple[bytes, str, s
     warnings: List[str] = []
     if fmt in ("markdown", "text"):
         text = raw.decode("utf-8", "replace")
+    elif fmt == "html":
+        text = _read_html(raw.decode("utf-8", "replace"))
     elif fmt == "docx":
         text, w = _read_docx(path, raw, prefer_optional)
         warnings += w
@@ -919,6 +1118,13 @@ def build_extraction(text: str, raw: bytes, fmt: str,
                      source_path: Optional[str]) -> JSON:
     """Run the deterministic tier and assemble the output contract object."""
     sha = hashlib.sha256(raw).hexdigest()
+    # Field extractors (parties, dates, governing law, term, value, defined
+    # terms) run on a whitespace-flattened copy so values that wrap across a
+    # line break in the source -- "...laws of the Province\nof Ontario", a party
+    # name split mid-line -- are matched whole. Clause detection and the title
+    # keep the original text, which depends on line structure.
+    flat = re.sub(r"[ \t\r\f\v]*\n[ \t\r\f\v]*", " ", text)
+    flat = re.sub(r"[ \t]+", " ", flat)
     return {
         "document": {
             "title": extract_title(text, Path(source_path) if source_path else None, fmt),
@@ -926,13 +1132,13 @@ def build_extraction(text: str, raw: bytes, fmt: str,
             "sha256": sha,
             "source_path": source_path,
         },
-        "parties": extract_parties(text),
-        "dates": extract_dates(text),
-        "term": extract_term(text),
-        "governing_law": extract_governing_law(text),
+        "parties": extract_parties(flat),
+        "dates": extract_dates(flat),
+        "term": extract_term(flat),
+        "governing_law": extract_governing_law(flat),
         "clauses": extract_clauses(text),
-        "defined_terms": extract_defined_terms(text),
-        "value": extract_value(text),
+        "defined_terms": extract_defined_terms(flat),
+        "value": extract_value(flat),
         "_meta": {
             "extractor_version": EXTRACTOR_VERSION,
             "tiers_used": ["deterministic"],
@@ -1244,7 +1450,7 @@ def output_schema() -> JSON:
                 "required": ["title", "format", "sha256", "source_path"],
                 "properties": {
                     "title": {"type": ["string", "null"]},
-                    "format": {"enum": ["markdown", "text", "docx", "pdf"]},
+                    "format": {"enum": ["markdown", "text", "docx", "pdf", "html"]},
                     "sha256": {"type": "string", "pattern": "^[0-9a-f]{64}$"},
                     "source_path": {"type": ["string", "null"]},
                 },
@@ -1293,7 +1499,7 @@ def output_schema() -> JSON:
                     "properties": {
                         "canonical_title": {"type": ["string", "null"]},
                         "detected_title": {"type": "string"},
-                        "tier": {"enum": ["h2", "bold-numbered", "all-caps", "explicit", "llm"]},
+                        "tier": {"enum": ["h2", "bold-numbered", "numbered", "all-caps", "explicit", "llm"]},
                         "span": {
                             "type": "object",
                             "required": ["start", "end"],
@@ -1595,7 +1801,7 @@ def _add_common_output_flags(p: argparse.ArgumentParser) -> None:
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(
         prog="extract",
-        description="Ingest any contract (.md/.txt/.docx/.pdf) and emit structured "
+        description="Ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured "
                     "JSON for the contract-ops CLI suite. See docs/INTEROP.md.",
     )
     parser.add_argument("-V", "--version", action="version",
@@ -1629,7 +1835,7 @@ def build_parser() -> argparse.ArgumentParser:
 def _build_extract_args(p: argparse.ArgumentParser) -> None:
-    p.add_argument("path", help="Path to the document (.md/.txt/.docx/.pdf).")
+    p.add_argument("path", help="Path to the document (.md/.txt/.html/.docx/.pdf).")
     p.add_argument("--llm", action="store_true",
                    help="Opt-in LLM enrichment of fuzzy fields (renewal, obligations). "
                         "Off by default; the deterministic core is fully useful without it.")

{extract_cli-0.1.0 → extract_cli-0.1.2}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
 [project]
 name = "extract-cli"
-version = "0.1.0"
-description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON."
+version = "0.1.2"
+description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON."
 readme = "README.md"
 requires-python = ">=3.9"
 license = { text = "MIT" }

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/_make_goldens.py RENAMED Viewed

@@ -20,7 +20,8 @@ from tests._fixtures_build import ensure_binary_fixtures  # noqa: E402
 FIXTURES = Path(__file__).resolve().parent / "fixtures"
 DOCS = ["nda_h2.md", "services_bold.txt", "lease_allcaps.txt",
-        "employment_docx.docx", "license_pdf.pdf", "scanned.pdf"]
+        "employment_docx.docx", "license_pdf.pdf", "services_html.html",
+        "scanned.pdf"]
 def golden_for(name: str) -> dict:

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/conftest.py RENAMED Viewed

@@ -26,6 +26,7 @@ CORPUS: Tuple[Tuple[str, str, str], ...] = (
     ("lease_allcaps.txt", "all-caps", "text"),
     ("employment_docx.docx", "bold-numbered", "docx"),
     ("license_pdf.pdf", "all-caps", "pdf"),
+    ("services_html.html", "numbered", "html"),
 )

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/employment_docx.docx.expected.json RENAMED Viewed

@@ -138,7 +138,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/lease_allcaps.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/license_pdf.pdf.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/nda_h2.md.expected.json RENAMED Viewed

@@ -121,6 +121,11 @@
       "confidence": 0.6,
       "source": "deterministic"
     },
+    {
+      "term": "Disclosing Party",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
     {
       "term": "Receiving Party",
       "confidence": 0.6,
@@ -138,7 +143,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/scanned.pdf.expected.json RENAMED Viewed

@@ -48,7 +48,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/fixtures/services_bold.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.0",
+    "extractor_version": "0.1.2",
     "tiers_used": [
       "deterministic"
     ],

extract_cli-0.1.2/tests/fixtures/services_html.html ADDED Viewed

@@ -0,0 +1,35 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <title>Exhibit 10.1</title>
+  <style>body { font-family: serif; } .hidden { display:none; }</style>
+  <script>var x = "(this should never appear in output)";</script>
+</head>
+<body>
+  <p align="center"><b>MASTER SERVICES AGREEMENT</b></p>
+  <p>This Master Services Agreement (the &ldquo;Agreement&rdquo;) is entered
+  into as of March 15, 2023 (the &quot;Effective Date&quot;), by and between
+  Initrode&nbsp;Systems,&nbsp;Inc., a Delaware corporation (&ldquo;Provider&rdquo;),
+  and Hooli&nbsp;LLC (&ldquo;Customer&rdquo;).</p>
+  <p>1. Services</p>
+  <p>Provider shall perform the services described in each Statement of Work.</p>
+  <p>2. Fees and Payment</p>
+  <p>Customer shall pay Provider the fees set forth in the applicable Statement
+  of Work, not to exceed $500,000 in the aggregate.</p>
+  <p>3. Term and Termination</p>
+  <p>The initial term of this Agreement is two (2) years. Either party may
+  terminate upon sixty (60) days&rsquo; written notice. This Agreement shall
+  automatically renew for successive one-year terms.</p>
+  <p>4. Confidentiality</p>
+  <p>Each party shall protect the other&rsquo;s &ldquo;Confidential
+  Information&rdquo; using reasonable care.</p>
+  <p>5. Governing Law</p>
+  <p>This Agreement shall be governed by the laws of the State of California.</p>
+</body>
+</html>

extract_cli-0.1.2/tests/fixtures/services_html.html.expected.json ADDED Viewed

@@ -0,0 +1,157 @@
+{
+  "document": {
+    "title": "MASTER SERVICES AGREEMENT",
+    "format": "html",
+    "sha256": "088b40f13135e6b5d8f8548b162d657f10725d348388c7c3a416d11d7fc65300",
+    "source_path": "services_html.html"
+  },
+  "parties": [
+    {
+      "name": "Initrode Systems, Inc.",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": "Provider"
+    },
+    {
+      "name": "Hooli LLC",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": "Customer"
+    }
+  ],
+  "dates": {
+    "effective": {
+      "value": "2023-03-15",
+      "confidence": 0.9,
+      "source": "deterministic"
+    },
+    "expiration": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    }
+  },
+  "term": {
+    "length": {
+      "value": "2 years",
+      "confidence": 0.7,
+      "source": "deterministic"
+    },
+    "auto_renew": {
+      "value": true,
+      "confidence": 0.65,
+      "source": "deterministic"
+    },
+    "notice_period_days": {
+      "value": 60,
+      "confidence": 0.7,
+      "source": "deterministic"
+    }
+  },
+  "governing_law": {
+    "value": "State of California",
+    "confidence": 0.85,
+    "source": "deterministic"
+  },
+  "clauses": [
+    {
+      "canonical_title": "Services",
+      "detected_title": "1. Services",
+      "tier": "numbered",
+      "span": {
+        "start": 242,
+        "end": 329
+      },
+      "confidence": 0.6,
+      "source": "deterministic",
+      "mapped": false
+    },
+    {
+      "canonical_title": "Payment",
+      "detected_title": "2. Fees and Payment",
+      "tier": "numbered",
+      "span": {
+        "start": 329,
+        "end": 476
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Termination",
+      "detected_title": "3. Term and Termination",
+      "tier": "numbered",
+      "span": {
+        "start": 476,
+        "end": 692
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Confidentiality",
+      "detected_title": "4. Confidentiality",
+      "tier": "numbered",
+      "span": {
+        "start": 692,
+        "end": 800
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Governing Law",
+      "detected_title": "5. Governing Law",
+      "tier": "numbered",
+      "span": {
+        "start": 800,
+        "end": 890
+      },
+      "confidence": 0.8,
+      "source": "deterministic",
+      "mapped": true
+    }
+  ],
+  "defined_terms": [
+    {
+      "term": "Agreement",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Effective Date",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Provider",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Customer",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Confidential Information",
+      "confidence": 0.6,
+      "source": "deterministic"
+    }
+  ],
+  "value": {
+    "value": "$500,000",
+    "confidence": 0.6,
+    "source": "deterministic"
+  },
+  "_meta": {
+    "extractor_version": "0.1.2",
+    "tiers_used": [
+      "deterministic"
+    ],
+    "llm_used": false
+  }
+}

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/test_clause_map.py RENAMED Viewed

@@ -25,6 +25,50 @@ def test_tier3_all_caps() -> None:
     assert [c["tier"] for c in clauses] == ["all-caps", "all-caps"]
+def test_tier_numbered_plain_headings() -> None:
+    # Real-world dominant format: plain numbered, mixed-case, unbolded headings.
+    text = ("1. Term And Nature Of Employment\n\nbody about term\n\n"
+            "2. Wage Compensation\n\nbody about wages\n\n"
+            "5. Termination\n\nbody about termination")
+    clauses = ex.detect_clauses(text)
+    assert [c["tier"] for c in clauses] == ["numbered", "numbered", "numbered"]
+    assert clauses[0]["title"] == "Term And Nature Of Employment"
+    assert clauses[2]["title"] == "Termination"
+def test_numbered_heading_rejects_sentences() -> None:
+    # "1. The Company shall pay..." is a numbered sentence, not a heading.
+    assert ex._qualifies_as_numbered_heading("Wage Compensation")
+    assert ex._qualifies_as_numbered_heading("Term And Nature Of Employment")
+    assert ex._qualifies_as_numbered_heading("Termination")
+    assert not ex._qualifies_as_numbered_heading("The Company shall pay the Employee monthly")
+    assert not ex._qualifies_as_numbered_heading("Fee")  # single word < 4 letters
+    assert not ex._qualifies_as_numbered_heading(
+        "EMPLOYEE shall be compensated on the basis of an annual salary")
+def test_numbered_section_article_prefixes() -> None:
+    text = ("Section 1. Definitions\n\nx\n\nSection 2. Confidentiality\n\ny\n\n"
+            "Article IV. Governing Law\n\nz")
+    clauses = ex.detect_clauses(text)
+    assert all(c["tier"] == "numbered" for c in clauses)
+    assert clauses[0]["title"] == "Definitions"
+    assert clauses[2]["title"] == "Governing Law"
+def test_numbered_does_not_shadow_bold() -> None:
+    # Bold-numbered must win over plain-numbered when both could match.
+    text = "**1. Purpose**\n\nx\n\n**2. Scope**\n\ny"
+    assert all(c["tier"] == "bold-numbered" for c in ex.detect_clauses(text))
+def test_trailing_period_stripped_from_titles() -> None:
+    canon, mapped = ex._canonicalize_clause("Other Benefits.")
+    assert canon == "Other Benefits"
+    # And a mapped clause with a trailing period still maps.
+    assert ex._canonicalize_clause("Survival.") == ("Survival", True)
 def test_cascade_priority_h2_wins() -> None:
     # An H2 present means the bold/all-caps fallbacks must not fire.
     text = "## Real Heading\n\n**1. Not A Heading**\n\nALSO NOT A HEADING\n\nbody"

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/test_cli.py RENAMED Viewed

@@ -22,7 +22,7 @@ def test_version(capsys: pytest.CaptureFixture[str]) -> None:
     with pytest.raises(SystemExit) as exc:
         ex.main(["--version"])
     assert exc.value.code == 0
-    assert "extract-cli 0.1.0" in capsys.readouterr().out
+    assert f"extract-cli {ex.__version__}" in capsys.readouterr().out
 def test_demo_runs(capsys: pytest.CaptureFixture[str]) -> None:

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/test_deterministic.py RENAMED Viewed

@@ -12,8 +12,8 @@ def test_parties_between_simple() -> None:
     assert all(0.0 <= p["confidence"] <= 1.0 for p in parties)
-def test_parties_with_roles_and_linebreak() -> None:
-    text = ('by and between Acme Corp. (the "Disclosing\nParty") and '
+def test_parties_with_roles() -> None:
+    text = ('by and between Acme Corp. (the "Disclosing Party") and '
             'Beta LLC (the "Receiving Party"), dated March 1, 2024.')
     parties = ex.extract_parties(text)
     assert parties[0]["name"] == "Acme Corp."
@@ -22,6 +22,30 @@ def test_parties_with_roles_and_linebreak() -> None:
     assert parties[1]["role"] == "Receiving Party"
+def test_parties_linebreak_handled_by_build() -> None:
+    # build_extraction flattens whitespace, so a party/role that wraps across a
+    # line is matched whole.
+    text = ('This Agreement is made by and between Acme Corp. (the "Disclosing\n'
+            'Party") and Beta LLC (the "Receiving Party").')
+    r = ex.build_extraction(text, text.encode("utf-8"), "text", "x.txt")
+    assert [p["name"] for p in r["parties"]] == ["Acme Corp.", "Beta LLC"]
+    assert r["parties"][0]["role"] == "Disclosing Party"
+def test_parties_skip_and_inside_description() -> None:
+    # An "and" inside a party's own description must not split the parties.
+    text = ("between Blade Ventures Inc., a Nevada corporation having offices at "
+            "1 Main St and doing business as Foo (\"Client\"), and KPMG LP")
+    parties = ex.extract_parties(text)
+    assert [p["name"] for p in parties] == ["Blade Ventures Inc.", "KPMG LP"]
+def test_party_name_descriptors_trimmed() -> None:
+    assert ex._clean_party_name("Visteon Corporation, a Delaware corporation") == "Visteon Corporation"
+    assert ex._clean_party_name("Foo Inc. doing business as Bar") == "Foo Inc."
+    assert ex._clean_party_name("Baz LLC having its principal office at X") == "Baz LLC"
 def test_parties_none() -> None:
     assert ex.extract_parties("There are no parties named here.") == []
@@ -39,6 +63,25 @@ def test_dates_iso_normalization() -> None:
         assert out["source"] == "deterministic"
+def test_dates_effective_date_label_and_as_of() -> None:
+    # The "(the "Effective Date")" anchor, with the date wrapping a newline.
+    text = 'between A and B as of August\n31, 2016 (the "Effective Date").'
+    assert ex.extract_dates(text)["effective"]["value"] == "2016-08-31"
+    # Bare "as of <date>" cue.
+    assert ex.extract_dates("dated as of June 1, 2023")["effective"]["value"] == "2023-06-01"
+def test_term_length_rejects_non_number() -> None:
+    # "...for consecutive days" must NOT be reported as a term length.
+    text = "the Employment Period shall run for consecutive days as scheduled"
+    assert ex.extract_term(text)["length"]["source"] == "none"
+def test_title_skips_sgml_wrapper() -> None:
+    text = "<DOCUMENT>\n<TYPE>EX-10\n<TEXT>\n\nEMPLOYMENT AGREEMENT\n\nbody"
+    assert ex.extract_title(text, None, "text") == "EMPLOYMENT AGREEMENT"
 def test_dates_missing() -> None:
     out = ex.extract_dates("no dates in here")
     assert out["effective"] == ex._none_field()
@@ -61,6 +104,15 @@ def test_governing_law_stops_before_trailing_clause() -> None:
     assert out["value"] == "State of Delaware"
+def test_governing_law_linebreak_handled_by_build() -> None:
+    # A jurisdiction that wraps a line ("...the Province\nof Ontario") is
+    # matched whole because build_extraction flattens whitespace first.
+    text = ("This Agreement shall be governed by the laws of the Province\n"
+            "of Ontario and the federal laws of Canada.")
+    r = ex.build_extraction(text, text.encode("utf-8"), "text", "x.txt")
+    assert r["governing_law"]["value"] == "Province of Ontario"
 def test_governing_law_missing() -> None:
     assert ex.extract_governing_law("nothing about law")["source"] == "none"

{extract_cli-0.1.0 → extract_cli-0.1.2}/tests/test_misc.py RENAMED Viewed

@@ -142,6 +142,45 @@ def test_pdf_unescape() -> None:
     assert ex._pdf_unescape(r"\101\102") == "AB"  # octal escapes
+def test_html_extraction() -> None:
+    raw, text, fmt, _w = ex.load_source(FIXTURES / "services_html.html")
+    assert fmt == "html"
+    # script/style content is dropped; entities are unescaped.
+    assert "this should never appear" not in text
+    result = ex.build_extraction(text, raw, fmt, "services_html.html")
+    assert result["document"]["format"] == "html"
+    assert [p["name"] for p in result["parties"]] == ["Initrode Systems, Inc.", "Hooli LLC"]
+    assert result["governing_law"]["value"] == "State of California"
+    assert result["dates"]["effective"]["value"] == "2023-03-15"
+    canon = {c["canonical_title"] for c in result["clauses"]}
+    assert {"Payment", "Termination", "Confidentiality", "Governing Law"} <= canon
+def test_html_detected_by_content_sniff(tmp_path: Any) -> None:
+    # HTML masquerading as .txt (e.g. a SEC EDGAR full submission) is sniffed.
+    p = tmp_path / "exhibit.txt"
+    p.write_text("<html><body><p>between A Co and B Co</p></body></html>")
+    _raw, _text, fmt, _w = ex.load_source(p)
+    assert fmt == "html"
+def test_html_malformed_does_not_crash() -> None:
+    assert ex._read_html("<p>unclosed <b>bold <div>text") is not None
+def test_pdf_text_only_inside_bt_et() -> None:
+    # Strings outside BT/ET (font/signature/metadata stream bytes that happen to
+    # contain parentheses) must be ignored; only text objects yield text.
+    content = b"(garbage outside) /Font << >> BT (real text) Tj ET (more garbage)"
+    assert ex._pdf_text_from_content(content) == "real text"
+def test_pdf_mostly_printable_backstop() -> None:
+    assert ex._mostly_printable("Hello, world")
+    assert not ex._mostly_printable("\x00\x01\x02\x03\x04\x05\x06\x07")
+    assert not ex._mostly_printable("")
 def test_extract_json_object_from_noise() -> None:
     assert ex._extract_json_object('prefix {"a": 1} suffix') == {"a": 1}
     assert ex._extract_json_object("no json here") is None