PyPI - extract-cli - Versions diffs - 0.1.12__tar.gz → 0.1.14__tar.gz - Mend

extract-cli 0.1.12tar.gz → 0.1.14tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

{extract_cli-0.1.12 → extract_cli-0.1.14}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,40 @@ to [Semantic Versioning](https://semver.org/). Per the suite convention
 (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
 the output schema require a major version bump**; new optional fields are minor.
+## [0.1.14] - 2026-05-22
+### Improved
+- **HTML clause detection now recognizes emphasis-marked headings.** Real HTML
+  contracts (e.g. SEC EDGAR exhibits) mark section headings with emphasis, not
+  `##`/numbers — a heading tag, `<b>`/`<strong>`/`<u>`, or **CSS**
+  (`font-weight:bold` / `text-decoration:underline`), often with a leading
+  `(g)` / `1.` token and the body run-in. The HTML reader now emits such blocks
+  as `## ` headings (splitting run-in title from body; a lone emphasized block
+  is treated as a title and left plain so numbered/ALL-CAPS sections still win).
+  On the accuracy benchmark this lifts **clause recall 0.45 → 0.86** (F1 0.62 →
+  0.93), precision still 1.00. Residual misses are compound/combined headings.
+## [0.1.13] - 2026-05-22
+### Added
+- **Accuracy benchmark** (`tests/eval/`, `make eval`). Scores the deterministic
+  tier against a small corpus of real, executed SEC-EDGAR contracts with
+  hand-verified ground truth, reporting precision/recall/F1 per field — turning
+  "best-effort" into a measured number. Current: parties F1 0.96, effective
+  date / governing law / jurisdiction 1.00, clause recall 0.45 (heading
+  detection on dense HTML is the known weak spot). `tests/test_eval.py` gates it
+  so accuracy can't silently regress.
+### Fixed / improved (surfaced by the benchmark)
+- **Governing-law detection** now covers the common connector phrasings beyond
+  "governed by the laws of X": "governed by, **and enforced in accordance
+  with,** the laws of X", "**interpreted and enforced in accordance with** the
+  laws of X", "**construed under** the laws of X". (Benchmark: governing law
+  0.67 → 1.00.)
+- **Jurisdiction normalization** now maps **all 50 US states + DC** (plus more
+  Canadian provinces / UK nations / countries), not just a dozen. (Benchmark:
+  jurisdiction 0.67 → 1.00.)
 ## [0.1.12] - 2026-05-22
 ### Security
@@ -333,6 +367,8 @@ Initial release — the open-loop front door of the contract-ops CLI suite.
   intentionally *not* governed by the output schema (the schema describes the
   full default output).
+[0.1.14]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.14
+[0.1.13]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.13
 [0.1.12]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.12
 [0.1.11]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.11
 [0.1.10]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.10

{extract_cli-0.1.12 → extract_cli-0.1.14}/Makefile RENAMED Viewed

@@ -40,6 +40,9 @@ coverage:
 typecheck:
 	$(PYTHON) -m mypy --strict extract_cli.py
+eval:
+	$(PYTHON) tests/eval/evaluate.py
 build: clean
 	$(PYTHON) -m build

{extract_cli-0.1.12 → extract_cli-0.1.14}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: extract-cli
-Version: 0.1.12
+Version: 0.1.14
 Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.
 Project-URL: Homepage, https://cli.drbaher.com/
 Project-URL: Repository, https://github.com/DrBaher/extract-cli
@@ -256,13 +256,34 @@ paths. Configure it once and every suite tool that adopts the same lookup gets
 LLM features for free. Without it, `--llm` just warns and returns the
 deterministic output.
+## Accuracy
+Line coverage tells you the code runs; it doesn't tell you the extraction is
+*correct*. `make eval` scores the deterministic tier against a small corpus of
+**real, executed contracts** (SEC EDGAR filings) with hand-verified ground truth
+([`tests/eval/`](tests/eval/)), reporting precision/recall per field:
+| Field | Score |
+|---|---|
+| parties | P 1.00 · R 0.92 · F1 0.96 |
+| effective date | accuracy 1.00 |
+| governing law | accuracy 1.00 |
+| jurisdiction (normalized) | accuracy 1.00 |
+| clauses (recall on verified sections) | 0.86 |
+Clause recall improved sharply once the HTML reader learned to treat
+emphasis (heading tags, <b>/<u>, CSS font-weight/underline) as section
+headings; the residual misses are compound/combined heading titles. A test (`tests/test_eval.py`) gates these so
+accuracy can't silently regress.
 ## Development
 ```bash
 make install      # editable install with the [dev] extra
 make test         # full suite
-make coverage     # suite + coverage report
+make coverage     # suite + coverage report (installs extras; fails under 100%)
 make typecheck    # mypy --strict
+make eval         # accuracy benchmark vs the labeled corpus
 make build        # wheel + sdist
 make smoke        # build, install the wheel in a clean venv, run it
 make spec-check   # assert docs/spec schema == `extract schema`

{extract_cli-0.1.12 → extract_cli-0.1.14}/README.md RENAMED Viewed

@@ -218,13 +218,34 @@ paths. Configure it once and every suite tool that adopts the same lookup gets
 LLM features for free. Without it, `--llm` just warns and returns the
 deterministic output.
+## Accuracy
+Line coverage tells you the code runs; it doesn't tell you the extraction is
+*correct*. `make eval` scores the deterministic tier against a small corpus of
+**real, executed contracts** (SEC EDGAR filings) with hand-verified ground truth
+([`tests/eval/`](tests/eval/)), reporting precision/recall per field:
+| Field | Score |
+|---|---|
+| parties | P 1.00 · R 0.92 · F1 0.96 |
+| effective date | accuracy 1.00 |
+| governing law | accuracy 1.00 |
+| jurisdiction (normalized) | accuracy 1.00 |
+| clauses (recall on verified sections) | 0.86 |
+Clause recall improved sharply once the HTML reader learned to treat
+emphasis (heading tags, <b>/<u>, CSS font-weight/underline) as section
+headings; the residual misses are compound/combined heading titles. A test (`tests/test_eval.py`) gates these so
+accuracy can't silently regress.
 ## Development
 ```bash
 make install      # editable install with the [dev] extra
 make test         # full suite
-make coverage     # suite + coverage report
+make coverage     # suite + coverage report (installs extras; fails under 100%)
 make typecheck    # mypy --strict
+make eval         # accuracy benchmark vs the labeled corpus
 make build        # wheel + sdist
 make smoke        # build, install the wheel in a clean venv, run it
 make spec-check   # assert docs/spec schema == `extract schema`

{extract_cli-0.1.12 → extract_cli-0.1.14}/extract_cli.py RENAMED Viewed

@@ -43,11 +43,11 @@ import urllib.request
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
-__version__ = "0.1.12"
+__version__ = "0.1.14"
 # Bumped independently of the package version when the *extraction logic*
 # changes in a way downstream consumers should notice. Embedded in `_meta`.
-EXTRACTOR_VERSION = "0.1.12"
+EXTRACTOR_VERSION = "0.1.14"
 # JSON Schema version of the output contract (docs/spec/extract-output.schema.json).
 SCHEMA_VERSION = 1
@@ -616,8 +616,11 @@ _ROLE_PAREN_RE = re.compile(
 # enforces a capitalized proper noun (a global re.IGNORECASE would defeat that
 # and over-capture trailing lowercase clauses like ", without regard to ...").
 _GOV_LAW_RE = re.compile(
-    r"(?i:governed\s+by(?:\s+and\s+construed\s+in\s+accordance\s+with)?\s+"
-    r"(?:the\s+)?laws?\s+of\s+(?:the\s+)?)"
+    # Allow a short same-sentence gap between "governed by" and "laws of" so the
+    # many real connector phrasings are covered: "...and construed in accordance
+    # with...", "...and enforced in accordance with...", "the internal laws of",
+    # etc. (bounded + lazy so it stays within the clause).
+    r"(?i:(?:governed|construed|interpreted|enforced)\b[^.\n]{0,60}?\blaws?\s+of\s+(?:the\s+)?)"
     r"([A-Z][A-Za-z\.\- ]+?(?:,\s*[A-Z][A-Za-z\.\- ]+?)?)"
     r"(?=[\.,;\n)]|\s+and\b|\s+without\b|$)",
 )
@@ -889,16 +892,31 @@ def extract_signatories(text: str) -> List[JSON]:
     return out
-# Free-text jurisdiction -> a normalized ISO-ish code (best-effort, common only).
+# Free-text jurisdiction -> a normalized ISO 3166-2 / ISO 3166-1 code. All 50 US
+# states + DC, common Canadian provinces, UK nations, and frequent countries.
+_US_STATES: Dict[str, str] = {
+    "alabama": "AL", "alaska": "AK", "arizona": "AZ", "arkansas": "AR",
+    "california": "CA", "colorado": "CO", "connecticut": "CT", "delaware": "DE",
+    "florida": "FL", "georgia": "GA", "hawaii": "HI", "idaho": "ID",
+    "illinois": "IL", "indiana": "IN", "iowa": "IA", "kansas": "KS",
+    "kentucky": "KY", "louisiana": "LA", "maine": "ME", "maryland": "MD",
+    "massachusetts": "MA", "michigan": "MI", "minnesota": "MN", "mississippi": "MS",
+    "missouri": "MO", "montana": "MT", "nebraska": "NE", "nevada": "NV",
+    "new hampshire": "NH", "new jersey": "NJ", "new mexico": "NM", "new york": "NY",
+    "north carolina": "NC", "north dakota": "ND", "ohio": "OH", "oklahoma": "OK",
+    "oregon": "OR", "pennsylvania": "PA", "rhode island": "RI", "south carolina": "SC",
+    "south dakota": "SD", "tennessee": "TN", "texas": "TX", "utah": "UT",
+    "vermont": "VT", "virginia": "VA", "washington": "WA", "west virginia": "WV",
+    "wisconsin": "WI", "wyoming": "WY", "district of columbia": "DC",
+}
 _JURISDICTION_CODES: Dict[str, str] = {
-    "delaware": "US-DE", "new york": "US-NY", "california": "US-CA",
-    "texas": "US-TX", "illinois": "US-IL", "massachusetts": "US-MA",
-    "washington": "US-WA", "florida": "US-FL", "nevada": "US-NV",
-    "new jersey": "US-NJ", "pennsylvania": "US-PA", "michigan": "US-MI",
+    **{name: f"US-{code}" for name, code in _US_STATES.items()},
     "ontario": "CA-ON", "quebec": "CA-QC", "british columbia": "CA-BC",
-    "england and wales": "GB-EAW", "england": "GB-ENG", "scotland": "GB-SCT",
+    "alberta": "CA-AB", "england and wales": "GB-EAW", "england": "GB-ENG",
+    "scotland": "GB-SCT", "wales": "GB-WLS", "northern ireland": "GB-NIR",
     "united kingdom": "GB", "france": "FR", "germany": "DE", "ireland": "IE",
     "singapore": "SG", "australia": "AU", "india": "IN", "netherlands": "NL",
+    "switzerland": "CH", "japan": "JP",
 }
@@ -1051,11 +1069,24 @@ def _detect_format(path: Path, raw: bytes) -> str:
     return base
+def _looks_like_heading_text(s: str) -> bool:
+    """Lenient: short, few words, not a full sentence -- used to decide whether
+    an *emphasized* HTML block is a clause heading."""
+    s = s.strip().rstrip(".:;,")
+    return 2 <= len(s) <= 90 and len(s.split()) <= 10
 class _HTMLTextExtractor(html.parser.HTMLParser):
-    """Stdlib HTML -> text: drops script/style, frames block elements with blank
-    lines (so clause-heading detection still works), and unescapes entities."""
+    """Stdlib HTML -> text. Drops script/style, frames blocks with blank lines,
+    unescapes entities, and -- crucially for clause detection -- emits blocks
+    that are emphasized (a heading tag, or text wrapped in <b>/<strong>/<u>) as
+    Markdown `## headings`. Real contracts (e.g. SEC HTML exhibits) mark section
+    headings with emphasis, not `##`/numbers, so without this the cascade sees
+    only plain lines. A run-in heading (emphasized lead + body in one block) is
+    split into `## Title` + body."""
     _SKIP = {"script", "style", "head", "title", "meta", "link", "noscript"}
+    _EMPH = {"b", "strong", "u", "h1", "h2", "h3", "h4", "h5", "h6"}
     _BLOCK = {
         "p", "div", "br", "tr", "li", "h1", "h2", "h3", "h4", "h5", "h6",
         "section", "article", "table", "ul", "ol", "blockquote", "pre", "hr",
@@ -1064,32 +1095,94 @@ class _HTMLTextExtractor(html.parser.HTMLParser):
     def __init__(self) -> None:
         super().__init__(convert_charrefs=True)
-        self._parts: List[str] = []
+        self._lines: List[str] = []
+        self._runs: List[Tuple[bool, str]] = []  # (emphasized, text) for current block
         self._skip = 0
+        self._emph = 0
+        # Per-tag-name LIFO stack of "did this open tag add emphasis?", so an
+        # emphasis opened by a CSS style (not just a <b>/<u> tag) is closed by
+        # the right end tag even when many <font>/<span> nest.
+        self._emph_stack: Dict[str, List[bool]] = {}
+    @staticmethod
+    def _style_is_emph(attrs: Any) -> bool:
+        for name, value in attrs:
+            if name == "style" and value:
+                v = value.lower()
+                if ("font-weight:bold" in v.replace(" ", "") or "font-weight:700" in v.replace(" ", "")
+                        or "text-decoration:underline" in v.replace(" ", "")):
+                    return True
+        return False
+    def _flush_block(self) -> None:
+        runs, self._runs = self._runs, []
+        full = re.sub(r"\s+", " ", "".join(t for _e, t in runs)).strip()
+        if not full:
+            self._lines.append("")
+            return
+        # Standalone emphasized block (a heading tag or fully <b>/<u>/styled text).
+        if all(e for e, t in runs if t.strip()) and _looks_like_heading_text(_strip_clause_number(full)):
+            self._lines.append("## " + _strip_clause_number(full))
+            return
+        # Run-in heading: an optional leading numbering token ("(g)", "1.") then
+        # an emphasized title, then the body in the same block.
+        i, saw_emph = 0, False
+        while i < len(runs):
+            emph, txt = runs[i]
+            if not txt.strip():
+                i += 1
+            elif emph:
+                saw_emph = True
+                i += 1
+            elif not saw_emph and re.fullmatch(r"\(?[0-9A-Za-z]{1,4}\)?[.)]?", txt.strip()):
+                i += 1  # skip a clause-number/letter prefix
+            else:
+                break
+        lead = _strip_clause_number(re.sub(r"\s+", " ", "".join(t for _e, t in runs[:i])).strip())
+        rest = re.sub(r"\s+", " ", "".join(t for _e, t in runs[i:])).strip()
+        if saw_emph and lead and rest and _looks_like_heading_text(lead):
+            self._lines.append("## " + lead)
+            self._lines.append(rest)
+        else:
+            self._lines.append(full)
     def handle_starttag(self, tag: str, attrs: Any) -> None:
         if tag in self._SKIP:
             self._skip += 1
-        elif tag in self._BLOCK:
-            self._parts.append("\n")
+            return
+        if tag in self._BLOCK:
+            self._flush_block()
+        added = tag in self._EMPH or self._style_is_emph(attrs)
+        self._emph_stack.setdefault(tag, []).append(added)
+        if added:
+            self._emph += 1
     def handle_endtag(self, tag: str) -> None:
         if tag in self._SKIP and self._skip > 0:
             self._skip -= 1
-        elif tag in self._BLOCK:
-            self._parts.append("\n")
+            return
+        stack = self._emph_stack.get(tag)
+        if stack:
+            if stack.pop() and self._emph > 0:
+                self._emph -= 1
+        if tag in self._BLOCK:
+            self._flush_block()
     def handle_data(self, data: str) -> None:
         if self._skip == 0:
-            self._parts.append(data)
+            self._runs.append((self._emph > 0, data))
     def get_text(self) -> str:
-        # Strip each line; collapse runs of blank lines to a single blank line
-        # (gives ALL-CAPS / numbered headings their blank-line frame).
-        lines = [re.sub(r"[ \t]+", " ", ln).strip() for ln in "".join(self._parts).split("\n")]
+        self._flush_block()
+        # A lone emphasized heading is almost always the document title, not a
+        # section scheme -- downgrade it to plain text so the numbered/ALL-CAPS
+        # tiers can still detect the real sections (matches the >=2 threshold the
+        # other fallback tiers use).
+        if sum(1 for ln in self._lines if ln.startswith("## ")) < 2:
+            self._lines = [ln[3:] if ln.startswith("## ") else ln for ln in self._lines]
         out: List[str] = []
         blank = False
-        for ln in lines:
+        for ln in self._lines:
             if ln:
                 out.append(ln)
                 blank = False

{extract_cli-0.1.12 → extract_cli-0.1.14}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "extract-cli"
-version = "0.1.12"
+version = "0.1.14"
 description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON."
 readme = "README.md"
 requires-python = ">=3.9"

extract_cli-0.1.14/tests/eval/ATTRIBUTION.md ADDED Viewed

@@ -0,0 +1,20 @@
+# Benchmark corpus — sources & licensing
+The accuracy benchmark (`tests/eval/`) scores extract-cli against a small set of
+**real, executed contracts** filed publicly with the U.S. Securities and
+Exchange Commission (SEC EDGAR). SEC filings are public records; these exhibits
+are reproduced here, unmodified, solely as a regression/accuracy test fixture.
+| File | Source (SEC EDGAR) |
+|---|---|
+| `emp_celsci.txt` | CEL-SCI Corporation — Exhibit 10(ooo), employment agreement |
+| `msa_kpmg.txt` | Blade Internet Ventures / KPMG Consulting — master services agreement |
+| `services_visteon.txt` | Visteon Corporation — salaried employee lease agreement |
+| `consulting_mtm.htm` | MTM Technologies — consulting agreement |
+| `emp_arcp.htm` | American Realty Capital Properties — employment agreement |
+| `emp_quadgraphics.htm` | Quad/Graphics, Inc. — employment agreement |
+Ground truth (`gold.json`) was hand-verified against each document's text — the
+parties, effective date, governing law, normalized jurisdiction, and a
+verified subset of section headings. It is intentionally independent of what the
+extractor currently produces.

extract-cli 0.1.12__tar.gz → 0.1.14__tar.gz

extract-cli 0.1.12tar.gz → 0.1.14tar.gz