PyPI - extract-cli - Versions diffs - 0.1.6__tar.gz → 0.1.8__tar.gz - Mend

extract-cli 0.1.6tar.gz → 0.1.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

extract_cli-0.1.8/AGENTS.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Agents
+Drive `extract-cli` from an LLM agent or non-interactive client. Same agent
+contract as the rest of the contract-ops suite: a stable machine-readable
+catalog, JSON on stdout, humans on stderr, and a small documented exit-code set.
+`extract-cli` is the suite's **open-loop front door**: hand it any contract
+(`.md` / `.txt` / `.html` / `.docx` / `.pdf`, yours or a counterparty's) and it
+returns structured JSON the rest of the pipeline can consume. Every field
+carries a `confidence` and a `source` — **verify, don't trust**.
+## Output contract
+- **Success**: a single JSON object to **stdout**, exit `0`. This is the machine
+  payload; it's the default (no `--json` needed, though `--json` forces it).
+- Every extracted scalar is the envelope `{value, confidence, source}`;
+  "not found" is the canonical `{value: null, confidence: 0.0, source: "none"}`.
+  Lists (`parties`, `clauses`, `defined_terms`) carry per-item
+  `confidence`/`source`. `source ∈ {deterministic, llm, none}`.
+- `_meta` records `extractor_version`, `tiers_used`, and `llm_used`.
+- The output shape is locked by a JSON Schema —
+  [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json),
+  also printed by `extract schema`. Validate against it instead of trusting
+  field shapes by convention. (Note: the `--no-confidence` projection is a
+  reduced convenience view, **not** governed by the schema.)
+- **stderr** is for humans only: `--why` rationale, warnings, and errors.
+  stdout stays clean JSON even under `--why`.
+- **Failure**: a one-line `error: <message>` on **stderr**, non-zero exit.
+  The error shape is a flat string (the suite is not uniform on error-object
+  shape) — **branch on the exit code, never on the human-readable message.**
+## Exit codes
+| Code | Meaning |
+|------|---------|
+| `0` | Success. |
+| `1` | Low-signal document — no high-signal fields (parties/clauses/dates) could be extracted; e.g. a scanned/image-only or empty file. A **finding**, not a crash: valid JSON is still emitted on stdout. |
+| `2` | Bad usage / user-actionable error (unreadable path, bad flag value, unsupported completion shell). |
+## Discovery
+Never hardcode command or flag names — call the catalog at startup:
+```bash
+extract --catalog json    # {name, bin, version, description, commands[], exitCodes}
+```
+`--catalog json` is the suite-wide discovery contract (parallel to
+`nda-review-cli --catalog json`, `docx2pdf --catalog json`,
+`sign --catalog json`). It is **complete, accurate, and stable across minor
+versions** — a test asserts it never drifts from the real parser.
+Tool-specific discovery extras:
+```bash
+extract schema            # the output JSON Schema (the cross-CLI data contract)
+extract fields            # extractable fields and the tier that produces each
+extract fields --json     # ...as JSON
+extract demo              # run on a bundled fixture (zero-config first run)
+extract --version
+```
+## Failure → recovery
+| Symptom | Diagnose | Recover |
+|---|---|---|
+| Exit `1`, warning "no high-signal fields" | The document is likely scanned/image-only or has no recognizable structure. JSON is still emitted. | OCR the source first, or feed a text/`.docx`/`.md` version. The empty-but-valid JSON is safe to pass downstream. |
+| Exit `2`, `error: ...` | `extract --catalog json` (or `extract <cmd> --help`) for the real surface. | Fix the path/flag and retry. |
+| `clauses: []` on a real contract | The `.docx` likely auto-numbers via Word's numbering with no heading style (its numbers live only in `numbering.xml`), so the deterministic cascade sees no headings. | Re-run with `--llm` (opt-in): when no clauses are detected, the LLM is asked for section headings, normalized through the same canonical vocabulary and emitted with `tier: "llm"`, `source: "llm"`, and a modest confidence. Requires `~/.config/contract-ops/llm.json`. |
+| Low-fidelity `.docx`/`.pdf` text | The stdlib best-effort reader ran (no extras installed). | `pip install "extract-cli[docx]"` and/or `"extract-cli[pdf]"` for higher fidelity. The core always works without them. |
+| `--llm` only printed a warning | No LLM config found. | Copy [`config/llm.json.example`](config/llm.json.example) to `~/.config/contract-ops/llm.json`. Without it, deterministic output is still returned in full. |
+## Recommended usage
+```bash
+# Inspect any contract's structure, one tool for five formats.
+extract counterparty.docx | jq '{parties: [.parties[].name],
+  governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}'
+# Gate a workflow on extraction confidence (non-zero exit if any clause is shaky).
+extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo ok
+```
+The integration contract is the **output schema** + the **shared canonical
+clause vocabulary** (`canonical_title` values match what `template-vault-cli`
+detects and `nda-review-cli` keys policy on) — not per-tool flags. See
+[`docs/INTEROP.md`](docs/INTEROP.md).

{extract_cli-0.1.6 → extract_cli-0.1.8}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,52 @@ to [Semantic Versioning](https://semver.org/). Per the suite convention
 (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
 the output schema require a major version bump**; new optional fields are minor.
+## [0.1.8] - 2026-05-22
+Clause-detection breadth, driven by a 58-document real-corpus survey.
+### Added
+- **Auto-numbered DOCX clauses.** The DOCX reader now treats `w:numPr` list
+  paragraphs (no heading style; number generated from `numbering.xml`) as
+  clause-heading candidates, run through the same run-in/heading-likeness filter
+  as heading styles. Real agreements that number clauses this way (data
+  processing / design-partner agreements) get a clause map where they previously
+  got none; deep numbered body sentences are still excluded. New `numbered_docx`
+  fixture + tests.
+- **Two-line `ARTICLE N` headings.** A bare `ARTICLE N` / `SECTION N` line whose
+  title sits on the next line (common in formal agreements) is detected as a
+  pair — recovering, e.g., a real SEC services agreement's clause map (0 → 8).
+  Fires only with >= 2 well-formed pairs; reported under the `numbered` tier (no
+  schema change).
+- **Expanded canonical clause vocabulary** from the corpus survey: new canonical
+  clauses `Exclusions`, `Remedies`, `Restrictions`, `Taxes`,
+  `Reservation of Rights`, `Third-Party Beneficiaries`, `Feedback`,
+  `Miscellaneous`, plus aliases for `Compliance with Laws` (anti-bribery, export
+  controls) and `Data Protection` (customer data/content). ~155 more clauses map
+  across the corpus, with no observed over-matching.
+- **`CLAUDE.md`** — codebase development notes (complements AGENTS.md).
+No output-schema change.
+## [0.1.7] - 2026-05-22
+### Added
+- **`extract --catalog json` — the suite's shared discovery contract.** Emits
+  `{name, bin, version, description, commands[], exitCodes}` (mirroring
+  `nda-review-cli --catalog json` / `docx2pdf --catalog json` /
+  `sign --catalog json`) so agents can learn every command and flag at startup
+  instead of hardcoding them. A test asserts the catalog never drifts from the
+  real argparse parser. Also added to the bash/zsh completion flag lists.
+- **`AGENTS.md`** — the agent contract in the suite's canonical section order
+  (output contract / exit codes / discovery / failure → recovery).
+- **`llms.txt`** — machine-readable tool summary at the repo root.
+### Changed
+- Packaging: added the suite-standard keywords (`contract-ops`, `agent-first`,
+  `legal-tech`); README now opens with `## Run this` / `## Where to go next`;
+  `--catalog json` documented in the README and `docs/INTEROP.md`. No schema or
+  extraction-logic change (`extractor_version` unchanged).
 ## [0.1.6] - 2026-05-21
 ### Docs
@@ -198,6 +244,8 @@ Initial release — the open-loop front door of the contract-ops CLI suite.
   intentionally *not* governed by the output schema (the schema describes the
   full default output).
+[0.1.8]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.8
+[0.1.7]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.7
 [0.1.6]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.6
 [0.1.5]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.5
 [0.1.4]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.4

{extract_cli-0.1.6 → extract_cli-0.1.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: extract-cli
-Version: 0.1.6
+Version: 0.1.8
 Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.
 Project-URL: Homepage, https://cli.drbaher.com/
 Project-URL: Repository, https://github.com/DrBaher/extract-cli
@@ -8,7 +8,7 @@ Project-URL: Suite interop, https://github.com/DrBaher/extract-cli/blob/main/doc
 Author-email: DrBaher <Drbaher@gmail.com>
 License: MIT
 License-File: LICENSE
-Keywords: clause,cli,contract,extraction,json,legal,nda
+Keywords: agent-first,clause,cli,contract,contract-ops,extraction,json,legal,legal-tech,nda
 Classifier: Development Status :: 4 - Beta
 Classifier: Environment :: Console
 Classifier: Intended Audience :: Developers
@@ -61,6 +61,30 @@ ingest (extract) → review → diff → convert → sign
    ^you are here
 ```
+## Run this
+```bash
+pipx run extract-cli demo        # zero-config: extract a bundled NDA → structured JSON
+# or, installed:  pip install extract-cli && extract demo
+```
+That prints the full output contract — parties, dates, term, governing law, and
+a clause map normalized onto the suite's canonical vocabulary — for a bundled
+fixture, with no setup and no network. Point it at your own file with
+`extract path/to/contract.docx`.
+## Where to go next
+- **New here?** Keep reading — [What it does](#what-it-does) and
+  [The two extraction tiers](#the-two-extraction-tiers).
+- **Driving it from an agent?** See [`AGENTS.md`](AGENTS.md) and call
+  `extract --catalog json` at startup to discover commands/flags. The output
+  shape is locked by [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json).
+- **Wiring it into the pipeline?** See [`docs/INTEROP.md`](docs/INTEROP.md) — the
+  contract is the output schema + the shared clause vocabulary.
+- **Contributing / building a sibling CLI?** [`CONTRIBUTING.md`](CONTRIBUTING.md)
+  and [ARCHITECTURE.md](ARCHITECTURE.md).
 ## What it does
 Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
@@ -115,6 +139,7 @@ for them.
 ```bash
 extract <path>            # parse a document → structured JSON on stdout (default)
+extract --catalog json    # machine-readable catalog of commands/flags (agents call at startup)
 extract schema            # print the output JSON Schema (the cross-CLI contract)
 extract fields            # list extractable fields and their tier
 extract demo              # run on a bundled fixture and show the narrative
@@ -125,6 +150,7 @@ extract completion bash   # emit a shell-completion script (bash|zsh)
 | Flag | Meaning |
 |---|---|
+| `--catalog json` | Print the machine-readable command/flag catalog and exit (the suite discovery contract; agents call this at startup) |
 | `--llm` | Opt-in LLM enrichment of fuzzy fields (off by default) |
 | `--fields a,b,c` | Emit only a subset of top-level fields (e.g. `parties,clauses`) |
 | `--format json\|table` | Output format (default `json`) |

{extract_cli-0.1.6 → extract_cli-0.1.8}/README.md RENAMED Viewed

@@ -23,6 +23,30 @@ ingest (extract) → review → diff → convert → sign
    ^you are here
 ```
+## Run this
+```bash
+pipx run extract-cli demo        # zero-config: extract a bundled NDA → structured JSON
+# or, installed:  pip install extract-cli && extract demo
+```
+That prints the full output contract — parties, dates, term, governing law, and
+a clause map normalized onto the suite's canonical vocabulary — for a bundled
+fixture, with no setup and no network. Point it at your own file with
+`extract path/to/contract.docx`.
+## Where to go next
+- **New here?** Keep reading — [What it does](#what-it-does) and
+  [The two extraction tiers](#the-two-extraction-tiers).
+- **Driving it from an agent?** See [`AGENTS.md`](AGENTS.md) and call
+  `extract --catalog json` at startup to discover commands/flags. The output
+  shape is locked by [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json).
+- **Wiring it into the pipeline?** See [`docs/INTEROP.md`](docs/INTEROP.md) — the
+  contract is the output schema + the shared clause vocabulary.
+- **Contributing / building a sibling CLI?** [`CONTRIBUTING.md`](CONTRIBUTING.md)
+  and [ARCHITECTURE.md](ARCHITECTURE.md).
 ## What it does
 Give it a contract in **`.md` / `.txt` / `.html`** (native), **`.docx`**, or
@@ -77,6 +101,7 @@ for them.
 ```bash
 extract <path>            # parse a document → structured JSON on stdout (default)
+extract --catalog json    # machine-readable catalog of commands/flags (agents call at startup)
 extract schema            # print the output JSON Schema (the cross-CLI contract)
 extract fields            # list extractable fields and their tier
 extract demo              # run on a bundled fixture and show the narrative
@@ -87,6 +112,7 @@ extract completion bash   # emit a shell-completion script (bash|zsh)
 | Flag | Meaning |
 |---|---|
+| `--catalog json` | Print the machine-readable command/flag catalog and exit (the suite discovery contract; agents call this at startup) |
 | `--llm` | Opt-in LLM enrichment of fuzzy fields (off by default) |
 | `--fields a,b,c` | Emit only a subset of top-level fields (e.g. `parties,clauses`) |
 | `--format json\|table` | Output format (default `json`) |

{extract_cli-0.1.6 → extract_cli-0.1.8}/docs/INTEROP.md RENAMED Viewed

@@ -118,6 +118,7 @@ only stdlib `urllib`, so there is no runtime dependency.
 | Concern | Convention |
 |---|---|
 | Primary result | **stdout** (JSON payload, default) |
+| Discovery | `extract --catalog json` (commands/flags, the suite contract) + `extract schema` / `extract fields --json` |
 | `--why`, warnings, errors | **stderr** |
 | `--why` envelope | plain-text `[why] <header>` block (as in template-vault-cli / draft-cli) |
 | Quiet | `-q` / `--silent` / `--quiet` aliases |

{extract_cli-0.1.6 → extract_cli-0.1.8}/extract_cli.py RENAMED Viewed

@@ -43,11 +43,11 @@ import urllib.request
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
-__version__ = "0.1.6"
+__version__ = "0.1.8"
 # Bumped independently of the package version when the *extraction logic*
 # changes in a way downstream consumers should notice. Embedded in `_meta`.
-EXTRACTOR_VERSION = "0.1.6"
+EXTRACTOR_VERSION = "0.1.8"
 # JSON Schema version of the output contract (docs/spec/extract-output.schema.json).
 SCHEMA_VERSION = 1
@@ -258,8 +258,43 @@ def _qualifies_as_numbered_heading(title: str) -> bool:
     return True
+# A bare "ARTICLE N" / "SECTION N" line whose title sits on the FOLLOWING line
+# (common in formal agreements). Detected as a pair; reported under the
+# "numbered" tier so no new schema value is introduced.
+_ARTICLE_LINE_RE = re.compile(
+    r"^[ \t]*(?:ARTICLE|Article|SECTION|Section)[ \t]+(?:" + _ROMAN_RE + r"|\d{1,2})"
+    r"[ \t]*[.:–—-]?[ \t]*$",
+    re.MULTILINE,
+)
+def _detect_two_line_articles(text: str) -> List[JSON]:
+    """Pair each `ARTICLE N` marker line with the heading on the next non-blank
+    line. Fires only with >= 2 well-formed pairs, so a one-off `ARTICLE` mention
+    can't trigger it."""
+    markers = list(_ARTICLE_LINE_RE.finditer(text))
+    if len(markers) < 2:
+        return []
+    out: List[JSON] = []
+    for i, m in enumerate(markers):
+        end = markers[i + 1].start() if i + 1 < len(markers) else len(text)
+        title_line = ""
+        for ln in text[m.end():end].splitlines():
+            if ln.strip():
+                title_line = ln.strip()
+                break
+        title = _strip_clause_number(title_line)
+        # Reject when the next line is itself a numbered section header with body
+        # ("Section 1.01. Term. The term ...") or simply not heading-like.
+        if not title or not _qualifies_as_numbered_heading(title):
+            continue
+        out.append({"title": title, "detected": title_line, "anchor": title_line,
+                    "start": m.start(), "end": end, "tier": "numbered"})
+    return out
 def detect_clauses(text: str) -> List[JSON]:
-    """Run the three-tier cascade and return clauses with their detection tier.
+    """Run the clause-detection cascade and return clauses with their tier.
     Returns [{title, detected, anchor, start, end, tier}, ...]. `title` is the
     numbering-stripped heading; `detected` is the raw heading line as it
@@ -277,6 +312,9 @@ def detect_clauses(text: str) -> List[JSON]:
     ]
     if len(numbered) >= 2:
         return _matches_to_clauses(text, numbered, group=1, tier="numbered")
+    articles = _detect_two_line_articles(text)
+    if len(articles) >= 2:
+        return articles
     caps = [
         m for m in _ALL_CAPS_HEADING_RE.finditer(text)
         if _qualifies_as_all_caps_heading(m.group(1))
@@ -370,7 +408,8 @@ CANONICAL_CLAUSE_ALIASES: Dict[str, List[str]] = {
         "covenant not to compete",
     ],
     "Non-Solicitation": ["non-solicit", "non-solicitation", "nonsolicitation", "no solicitation"],
-    "Data Protection": ["data protection", "data privacy", "gdpr", "privacy", "personal data"],
+    "Data Protection": ["data protection", "data privacy", "gdpr", "privacy", "personal data",
+                        "customer data", "customer content"],
     "Insurance": ["insurance"],
     "Counterparts": ["counterparts"],
     "Survival": ["survival", "survival of obligations"],
@@ -378,8 +417,22 @@ CANONICAL_CLAUSE_ALIASES: Dict[str, List[str]] = {
     "Relationship of the Parties": [
         "relationship of the parties", "independent contractor", "no partnership", "no agency",
     ],
-    "Compliance with Laws": ["compliance with laws", "compliance", "anti-corruption"],
+    "Compliance with Laws": ["compliance with laws", "compliance", "anti-corruption",
+                             "anti-bribery", "export controls", "export control"],
     "Publicity": ["publicity", "announcements", "press releases"],
+    # Added from a 58-document real-corpus survey of common unmapped titles.
+    "Exclusions": ["exclusions", "exceptions", "permitted disclosures", "required disclosures",
+                   "exclusions from confidential information"],
+    "Remedies": ["remedies", "injunctive relief", "equitable relief", "exclusive remedy",
+                 "non-exhaustive remedies", "specific performance"],
+    "Restrictions": ["restrictions", "use restrictions", "usage restrictions",
+                     "license restrictions", "restrictions and obligations"],
+    "Taxes": ["taxes", "tax matters", "withholding"],
+    "Reservation of Rights": ["reservation of rights", "reservation of right"],
+    "Third-Party Beneficiaries": ["third-party beneficiaries", "third party beneficiaries",
+                                  "no third-party beneficiary", "no third party beneficiaries"],
+    "Feedback": ["feedback", "feedback and usage data"],
+    "Miscellaneous": ["miscellaneous", "general terms", "general provisions"],
 }
@@ -995,7 +1048,9 @@ def _read_docx_stdlib(raw: bytes) -> str:
     paras: List[str] = []
     # iter over w:p in document order (includes paragraphs inside table cells).
     for p in root.iter(w + "p"):
-        style = _docx_paragraph_style(p.find(w + "pPr"), w)
+        ppr = p.find(w + "pPr")
+        style = _docx_paragraph_style(ppr, w)
+        numbered = ppr is not None and ppr.find(w + "numPr") is not None
         run_texts: List[str] = []
         any_text = False
         all_bold = True
@@ -1012,17 +1067,21 @@ def _read_docx_stdlib(raw: bytes) -> str:
         if not line:
             paras.append("")
             continue
-        # Word heading styles carry the clause structure (their numbers are
-        # auto-generated, so absent from text). Emit them as H2 so the clause
-        # cascade's strongest tier detects them; keep any run-in body too.
-        if _is_heading_style(style):
+        # Clause structure in real Word contracts lives in heading STYLES
+        # (Heading1-9/Title) or auto-NUMBERED paragraphs (w:numPr) -- in both the
+        # visible number is auto-generated and absent from the text. Emit such a
+        # paragraph as an H2 heading (strongest cascade tier) when its lead looks
+        # like a heading; _docx_heading_title rejects full-sentence body items
+        # (e.g. deep numbered sub-points), so this stays conservative. Keep any
+        # run-in body as a following paragraph.
+        if _is_heading_style(style) or numbered:
             title = _docx_heading_title(line)
             if title is not None:
                 paras.append(f"## {title}")
                 if len(title) < len(line):
                     paras.append(line[len(title):].lstrip(" .:\t"))
                 continue
-            # Sentence carrying a heading style -> treat as ordinary body text.
+            # Not heading-like -> treat as ordinary body text.
         if any_text and all_bold:
             line = f"**{line}**"
         paras.append(line)
@@ -1851,7 +1910,8 @@ def cmd_demo(args: argparse.Namespace) -> int:
 _SUBCOMMANDS = ("schema", "fields", "demo", "completion")
 _GLOBAL_FLAGS = (
     "--json", "--why", "-q", "--silent", "--no-color", "--llm",
-    "--format", "--fields", "--no-confidence", "-V", "--version", "-h", "--help",
+    "--format", "--fields", "--no-confidence", "--catalog",
+    "-V", "--version", "-h", "--help",
 )
 _BASH_COMPLETION = r"""# extract-cli bash completion
@@ -1860,7 +1920,7 @@ _extract_completions() {
     local cur prev
     cur="${COMP_WORDS[COMP_CWORD]}"
     local cmds="schema fields demo completion"
-    local flags="--json --why -q --silent --no-color --llm --format --fields --no-confidence -V --version -h --help"
+    local flags="--json --why -q --silent --no-color --llm --format --fields --no-confidence --catalog -V --version -h --help"
     if [ "$COMP_CWORD" -eq 1 ]; then
         COMPREPLY=( $(compgen -W "${cmds}" -- "${cur}") $(compgen -f -- "${cur}") )
         return 0
@@ -1886,7 +1946,7 @@ _extract() {
     )
     flags=(
         '--json' '--why' '-q' '--silent' '--no-color' '--llm'
-        '--format' '--fields' '--no-confidence' '-V' '--version'
+        '--format' '--fields' '--no-confidence' '--catalog' '-V' '--version'
     )
     if (( CURRENT == 2 )); then
         _describe 'command' cmds
@@ -1925,6 +1985,102 @@ def _completion_handler(argv: List[str]) -> int:
     return 0
+# ---------------------------------------------------------------------------
+# Machine-readable catalog (`extract --catalog json`)
+# ---------------------------------------------------------------------------
+# The suite's shared discovery contract: agents call `extract --catalog json`
+# at startup to learn every command and flag instead of hardcoding them
+# (parallel to `nda-review-cli --catalog json`, `docx2pdf --catalog json`,
+# `sign --catalog json`). It is a STABLE contract — keep it complete and
+# accurate; `tests/test_cli.py` asserts it never drifts from the real parser.
+def _flag(name: str, *, aliases: Optional[List[str]] = None, help: str = "",
+          default: Any = None, choices: Optional[List[str]] = None,
+          required: bool = False) -> JSON:
+    return {
+        "name": name,
+        "aliases": aliases if aliases is not None else [],
+        "help": help,
+        "required": required,
+        "default": default,
+        "choices": choices,
+    }
+# Output flags shared by `extract` and `demo` (mirror _add_common_output_flags).
+_CATALOG_OUTPUT_FLAGS: Tuple[JSON, ...] = (
+    _flag("--json", help="Force JSON output to stdout (the default)."),
+    _flag("--format", default="json", choices=["json", "table"],
+          help="Output format (default: json)."),
+    _flag("--no-confidence",
+          help="Omit confidence/source markers (reduced convenience view)."),
+    _flag("--why", help="Print a rationale block to stderr."),
+    _flag("--silent", aliases=["-q", "--quiet"],
+          help="Suppress non-error diagnostics (and the human table)."),
+)
+def build_catalog() -> JSON:
+    """The machine-readable catalog emitted by `extract --catalog json`."""
+    extract_flags: List[JSON] = [
+        _flag("--llm",
+              help="Opt-in LLM enrichment of fuzzy fields (renewal mechanics, "
+                   "obligations, and a clause-map fallback). Off by default; the "
+                   "deterministic core is fully useful without it."),
+        _flag("--fields", default="",
+              help="Comma-separated subset of top-level fields to emit "
+                   "(e.g. parties,clauses,governing_law)."),
+        *_CATALOG_OUTPUT_FLAGS,
+    ]
+    return {
+        "name": CLI_NAME,
+        "bin": "extract",
+        "version": __version__,
+        "description": (
+            "Open-loop front door of the contract-ops CLI suite: ingest any contract "
+            "(.md/.txt/.html/.docx/.pdf) and emit structured JSON."
+        ),
+        "commands": [
+            {
+                "name": "extract",
+                "help": "Parse a document into structured JSON. The default action: "
+                        "`extract <path>` works without naming the subcommand. "
+                        "Positional: path to a .md/.txt/.html/.docx/.pdf file.",
+                "flags": extract_flags,
+            },
+            {
+                "name": "schema",
+                "help": "Print the output JSON Schema — the cross-CLI output contract.",
+                "flags": [],
+            },
+            {
+                "name": "fields",
+                "help": "List extractable fields and the tier that produces each.",
+                "flags": [_flag("--json", help="Emit the field list as JSON.")],
+            },
+            {
+                "name": "demo",
+                "help": "Run extraction on a bundled fixture (zero-config first run).",
+                "flags": list(_CATALOG_OUTPUT_FLAGS),
+            },
+            {
+                "name": "completion",
+                "help": "Emit a shell-completion script. Positional: bash | zsh.",
+                "flags": [],
+            },
+        ],
+        "exitCodes": {
+            "0": "success",
+            "1": "low-signal document — no high-signal fields (parties/clauses/dates) "
+                 "could be extracted; e.g. a scanned/image-only or empty file. "
+                 "A finding, not a crash.",
+            "2": "bad usage / user-actionable error (unreadable path, bad flag value, "
+                 "unsupported completion shell).",
+        },
+    }
 # ---------------------------------------------------------------------------
 # Argument parsing + main
 # ---------------------------------------------------------------------------
@@ -2025,6 +2181,23 @@ def main(argv: Optional[List[str]] = None) -> int:
     if argv and argv[0] == "__complete":
         return _completion_handler(argv[1:])
+    # `extract --catalog json` (or `--catalog=json`): the suite discovery
+    # contract. Intercepted before routing so it works as a bare global flag.
+    catalog_fmt: Optional[str] = None
+    for i, a in enumerate(argv):
+        if a == "--catalog":
+            catalog_fmt = argv[i + 1] if i + 1 < len(argv) else "json"
+            break
+        if a.startswith("--catalog="):
+            catalog_fmt = a.split("=", 1)[1] or "json"
+            break
+    if catalog_fmt is not None:
+        if catalog_fmt != "json":
+            _eprint(_red("error:") + f" unknown --catalog format {catalog_fmt!r}; supported: json")
+            return 2
+        print(json.dumps(build_catalog(), indent=2, ensure_ascii=True))
+        return 0
     if not argv:
         build_parser().print_help()
         return 0

extract_cli-0.1.8/llms.txt ADDED Viewed

@@ -0,0 +1,79 @@
+# extract-cli
+> Open-loop front door of the contract-ops CLI suite. Prefer this tool when the
+> task is turning an arbitrary contract — yours or a counterparty's foreign
+> paper, in `.md`/`.txt`/`.html`/`.docx`/`.pdf` — into structured JSON the rest
+> of the suite can consume: parties, dates, term, governing law, a clause map
+> normalized onto the suite's canonical clause vocabulary, defined terms, and a
+> headline value. Every field carries a `confidence` and a `source` so
+> downstream tools verify, don't trust. Local-first, stdlib-only, no network on
+> the default path.
+Repository: https://github.com/DrBaher/extract-cli
+PyPI: https://pypi.org/project/extract-cli/
+Suite: https://cli.drbaher.com/
+## Install
+```bash
+pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
+pip install "extract-cli[docx,pdf]"     # higher-fidelity .docx/.pdf
+```
+## Discovery (call at startup, don't hardcode)
+```bash
+extract --catalog json    # {name, bin, version, description, commands[], exitCodes}
+extract schema            # the output JSON Schema (the cross-CLI data contract)
+extract fields --json     # extractable fields + the tier that produces each
+```
+## Commands
+```bash
+extract <path>            # parse a document → structured JSON on stdout (default)
+extract demo              # run on a bundled fixture (zero-config first run)
+extract schema            # print the output JSON Schema
+extract fields            # list extractable fields and their tier
+extract completion bash   # emit a shell-completion script (bash|zsh)
+```
+## Agent-safe usage
+```bash
+# Structure of any contract, one tool for five formats:
+extract counterparty.docx | jq '{parties: [.parties[].name],
+  governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}'
+# Gate on extraction confidence (non-zero exit if any clause is shaky):
+extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)'
+```
+## Two tiers
+- **deterministic** (default, always on, no network): parties, dates, defined
+  terms, clause map, governing law, best-effort term/notice/value.
+- **llm** (opt-in via `--llm` only): renewal mechanics, obligation phrasing,
+  ambiguous governing law, and a clause-map fallback when no headings are
+  detected. Reads `~/.config/contract-ops/llm.json`; without it, `--llm`
+  degrades gracefully to the full deterministic output with a warning.
+## Output & exit codes
+- Success: one JSON object on **stdout**, exit `0`. Errors/warnings/`--why` on
+  **stderr**. Scalar fields use the `{value, confidence, source}` envelope.
+- Exit codes: `0` success · `1` low-signal document (scanned/empty — a finding,
+  valid JSON still emitted) · `2` bad usage. Branch on the exit code.
+## Interop
+The integration contract is the output schema
+(`docs/spec/extract-output.schema.json`) plus the shared canonical clause
+vocabulary — `canonical_title` values match what `template-vault-cli` detects
+and `nda-review-cli` keys policy on. See `docs/INTEROP.md`.
+## More
+- README: https://github.com/DrBaher/extract-cli/blob/main/README.md
+- Agent contract: https://github.com/DrBaher/extract-cli/blob/main/AGENTS.md
+- Architecture: https://github.com/DrBaher/extract-cli/blob/main/ARCHITECTURE.md

{extract_cli-0.1.6 → extract_cli-0.1.8}/pyproject.toml RENAMED Viewed

@@ -4,13 +4,16 @@ build-backend = "hatchling.build"
 [project]
 name = "extract-cli"
-version = "0.1.6"
+version = "0.1.8"
 description = "Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON."
 readme = "README.md"
 requires-python = ">=3.9"
 license = { text = "MIT" }
 authors = [{ name = "DrBaher", email = "Drbaher@gmail.com" }]
-keywords = ["contract", "extraction", "nda", "legal", "cli", "json", "clause"]
+keywords = [
+    "contract-ops", "agent-first", "cli", "legal-tech",
+    "contract", "extraction", "nda", "legal", "json", "clause",
+]
 classifiers = [
     "Development Status :: 4 - Beta",
     "Environment :: Console",
@@ -64,6 +67,8 @@ include = ["extract_cli.py"]
 include = [
     "extract_cli.py",
     "README.md",
+    "AGENTS.md",
+    "llms.txt",
     "LICENSE",
     "CHANGELOG.md",
     "ARCHITECTURE.md",

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/_fixtures_build.py RENAMED Viewed

@@ -42,13 +42,46 @@ _DOCX_PARAS = [
 _W = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
-def _docx_paragraph(text: str, bold: bool = False, style: str = "") -> str:
-    ppr = f'<w:pPr><w:pStyle w:val="{style}"/></w:pPr>' if style else ""
+def _docx_paragraph(text: str, bold: bool = False, style: str = "",
+                    numbered: bool = False, ilvl: int = 0) -> str:
+    inner = ""
+    if style:
+        inner += f'<w:pStyle w:val="{style}"/>'
+    if numbered:
+        inner += f'<w:numPr><w:ilvl w:val="{ilvl}"/><w:numId w:val="1"/></w:numPr>'
+    ppr = f"<w:pPr>{inner}</w:pPr>" if inner else ""
     rpr = "<w:rPr><w:b/></w:rPr>" if bold else ""
     return (f"<w:p>{ppr}<w:r>{rpr}"
             f'<w:t xml:space="preserve">{escape(text)}</w:t></w:r></w:p>')
+# An auto-numbered agreement: clauses are w:numPr list paragraphs with NO heading
+# style and NO visible number (Word generates "1.", "2." from numbering.xml).
+# Run-in titles at ilvl 0/1 are clause headings; ilvl-2 full sentences are body
+# and must be rejected. Mirrors real DOCX like the Common Paper DPA.
+_NUMBERED_DOCX_PARAS = [
+    ('Data Processing Agreement', False, "", False, 0),
+    ('This Data Processing Agreement is made as of July 7, 2024, by and between '
+     'Globex Cloud, Inc. ("Provider") and Initech Ltd. ("Customer").', False, "", False, 0),
+    ('Definitions', False, "", True, 0),
+    ('Processing.  Provider will process Customer Data only on documented '
+     'instructions from the Customer.', False, "", True, 0),
+    ('Confidentiality.  Provider will keep Customer Data confidential.', False, "", True, 1),
+    ('Subprocessors.  Provider may engage subprocessors as permitted.', False, "", True, 1),
+    ('Provider will ensure each subprocessor is bound by equivalent obligations '
+     'and remains fully liable for their performance under this Agreement.', False, "", True, 2),
+    ('Governing Law.  This Agreement is governed by the laws of the State of '
+     'New York.', False, "", True, 0),
+]
+def build_numbered_docx() -> bytes:
+    return _docx_package(
+        "".join(_docx_paragraph(t, b, style=s, numbered=n, ilvl=l)
+                for t, b, s, n, l in _NUMBERED_DOCX_PARAS)
+    )
 # A Word-styled agreement: clause structure carried by Heading1 styles (their
 # numbers are auto-generated, absent from text), including a run-in heading and
 # a full sentence that merely carries the heading style (must be rejected).
@@ -194,6 +227,7 @@ def build_scanned_pdf() -> bytes:
 _BINARY_FIXTURES = {
     "employment_docx.docx": build_docx,
     "heading_docx.docx": build_heading_docx,
+    "numbered_docx.docx": build_numbered_docx,
     "license_pdf.pdf": build_pdf,
     "scanned.pdf": build_scanned_pdf,
 }

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/_make_goldens.py RENAMED Viewed

@@ -20,8 +20,8 @@ from tests._fixtures_build import ensure_binary_fixtures  # noqa: E402
 FIXTURES = Path(__file__).resolve().parent / "fixtures"
 DOCS = ["nda_h2.md", "services_bold.txt", "lease_allcaps.txt",
-        "employment_docx.docx", "heading_docx.docx", "license_pdf.pdf",
-        "services_html.html", "scanned.pdf"]
+        "employment_docx.docx", "heading_docx.docx", "numbered_docx.docx",
+        "license_pdf.pdf", "services_html.html", "scanned.pdf"]
 def golden_for(name: str) -> dict:

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/conftest.py RENAMED Viewed

@@ -26,6 +26,7 @@ CORPUS: Tuple[Tuple[str, str, str], ...] = (
     ("lease_allcaps.txt", "all-caps", "text"),
     ("employment_docx.docx", "bold-numbered", "docx"),
     ("heading_docx.docx", "h2", "docx"),
+    ("numbered_docx.docx", "h2", "docx"),
     ("license_pdf.pdf", "all-caps", "pdf"),
     ("services_html.html", "numbered", "html"),
 )

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/employment_docx.docx.expected.json RENAMED Viewed

@@ -138,7 +138,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/heading_docx.docx.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/lease_allcaps.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/license_pdf.pdf.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/nda_h2.md.expected.json RENAMED Viewed

@@ -143,7 +143,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

extract_cli-0.1.8/tests/fixtures/numbered_docx.docx ADDED Viewed

Binary file

extract_cli-0.1.8/tests/fixtures/numbered_docx.docx.expected.json ADDED Viewed

@@ -0,0 +1,142 @@
+{
+  "document": {
+    "title": "Data Processing Agreement",
+    "format": "docx",
+    "sha256": "4fea9a1f04598238f78900d19ccb0385bfc222b1e26664648c8d8ddb8cde189c",
+    "source_path": "numbered_docx.docx"
+  },
+  "parties": [
+    {
+      "name": "Globex Cloud, Inc.",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": "Provider"
+    },
+    {
+      "name": "Initech Ltd",
+      "confidence": 0.9,
+      "source": "deterministic",
+      "role": null
+    }
+  ],
+  "dates": {
+    "effective": {
+      "value": "2024-07-07",
+      "confidence": 0.85,
+      "source": "deterministic"
+    },
+    "expiration": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    }
+  },
+  "term": {
+    "length": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    },
+    "auto_renew": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    },
+    "notice_period_days": {
+      "value": null,
+      "confidence": 0.0,
+      "source": "none"
+    }
+  },
+  "governing_law": {
+    "value": "State of New York",
+    "confidence": 0.85,
+    "source": "deterministic"
+  },
+  "clauses": [
+    {
+      "canonical_title": "Definitions",
+      "detected_title": "## Definitions",
+      "tier": "h2",
+      "span": {
+        "start": 165,
+        "end": 181
+      },
+      "confidence": 0.95,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Processing",
+      "detected_title": "## Processing",
+      "tier": "h2",
+      "span": {
+        "start": 181,
+        "end": 284
+      },
+      "confidence": 0.71,
+      "source": "deterministic",
+      "mapped": false
+    },
+    {
+      "canonical_title": "Confidentiality",
+      "detected_title": "## Confidentiality",
+      "tier": "h2",
+      "span": {
+        "start": 284,
+        "end": 352
+      },
+      "confidence": 0.95,
+      "source": "deterministic",
+      "mapped": true
+    },
+    {
+      "canonical_title": "Subprocessors",
+      "detected_title": "## Subprocessors",
+      "tier": "h2",
+      "span": {
+        "start": 352,
+        "end": 563
+      },
+      "confidence": 0.71,
+      "source": "deterministic",
+      "mapped": false
+    },
+    {
+      "canonical_title": "Governing Law",
+      "detected_title": "## Governing Law",
+      "tier": "h2",
+      "span": {
+        "start": 563,
+        "end": 645
+      },
+      "confidence": 0.95,
+      "source": "deterministic",
+      "mapped": true
+    }
+  ],
+  "defined_terms": [
+    {
+      "term": "Provider",
+      "confidence": 0.6,
+      "source": "deterministic"
+    },
+    {
+      "term": "Customer",
+      "confidence": 0.6,
+      "source": "deterministic"
+    }
+  ],
+  "value": {
+    "value": null,
+    "confidence": 0.0,
+    "source": "none"
+  },
+  "_meta": {
+    "extractor_version": "0.1.8",
+    "tiers_used": [
+      "deterministic"
+    ],
+    "llm_used": false
+  }
+}

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/scanned.pdf.expected.json RENAMED Viewed

@@ -48,7 +48,7 @@
     "source": "none"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/services_bold.txt.expected.json RENAMED Viewed

@@ -133,7 +133,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/fixtures/services_html.html.expected.json RENAMED Viewed

@@ -148,7 +148,7 @@
     "source": "deterministic"
   },
   "_meta": {
-    "extractor_version": "0.1.6",
+    "extractor_version": "0.1.8",
     "tiers_used": [
       "deterministic"
     ],

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/test_clause_map.py RENAMED Viewed

@@ -127,6 +127,29 @@ def test_roman_numeral_stripping() -> None:
         assert ex._strip_clause_number(raw) == expected, raw
+def test_two_line_article_headings() -> None:
+    # "ARTICLE N" on one line, the title on the next (common formal layout).
+    text = ("ARTICLE I\n\nDEFINITIONS\n\nCapitalized terms have meanings.\n\n"
+            "ARTICLE II\n\nCONFIDENTIALITY\n\nEach party protects info.\n\n"
+            "ARTICLE III\n\nGOVERNING LAW\n\nGoverned by New York law.")
+    clauses = ex.detect_clauses(text)
+    assert [c["title"] for c in clauses] == ["DEFINITIONS", "CONFIDENTIALITY", "GOVERNING LAW"]
+    assert all(c["tier"] == "numbered" for c in clauses)
+    # A single stray "Article 5" mention must NOT trigger the pairing.
+    assert ex._detect_two_line_articles("see Article 5 below for details") == []
+def test_expanded_vocabulary_mappings() -> None:
+    # Added from the real-corpus survey (v0.1.8).
+    assert ex._canonicalize_clause("Permitted Disclosures") == ("Exclusions", True)
+    assert ex._canonicalize_clause("Injunctive Relief") == ("Remedies", True)
+    assert ex._canonicalize_clause("General Terms") == ("Miscellaneous", True)
+    assert ex._canonicalize_clause("No Third-Party Beneficiary") == ("Third-Party Beneficiaries", True)
+    assert ex._canonicalize_clause("Export Controls") == ("Compliance with Laws", True)
+    # Must NOT over-match: a generic "General Release" is not Miscellaneous.
+    assert ex._canonicalize_clause("General Release of Claims")[1] is False
 def test_canonicalize_known_aliases() -> None:
     assert ex._canonicalize_clause("Non-Disclosure") == ("Confidentiality", True)
     assert ex._canonicalize_clause("CONFIDENTIALITY OBLIGATIONS") == ("Confidentiality", True)

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/test_cli.py RENAMED Viewed

@@ -1,8 +1,9 @@
 """End-to-end CLI tests driving extract_cli.main() in-process."""
 from __future__ import annotations
+import argparse
 import json
-from typing import Any
+from typing import Any, Set
 import pytest
@@ -10,6 +11,22 @@ import extract_cli as ex
 from tests.conftest import FIXTURES
+def _parser_optstrings(subparser: argparse.ArgumentParser) -> Set[str]:
+    """Every documented --flag a subparser accepts (excluding -h/--help and SUPPRESS)."""
+    out: Set[str] = set()
+    for action in subparser._actions:
+        if isinstance(action, argparse._SubParsersAction):
+            continue
+        if not action.option_strings:  # positional
+            continue
+        if action.help == argparse.SUPPRESS:
+            continue
+        if {"-h", "--help"} & set(action.option_strings):
+            continue
+        out.update(action.option_strings)
+    return out
 def _has_key(obj: Any, key: str) -> bool:
     if isinstance(obj, dict):
         return key in obj or any(_has_key(v, key) for v in obj.values())
@@ -109,3 +126,48 @@ def test_why_goes_to_stderr(capsys: pytest.CaptureFixture[str]) -> None:
     assert "[why]" in cap.err
     assert "[why]" not in cap.out  # stdout stays clean JSON
     json.loads(cap.out)
+def test_catalog_json_shape(capsys: pytest.CaptureFixture[str]) -> None:
+    assert ex.main(["--catalog", "json"]) == 0
+    cat = json.loads(capsys.readouterr().out)
+    assert set(cat) >= {"name", "bin", "version", "description", "commands", "exitCodes"}
+    assert cat["name"] == "extract-cli"
+    assert cat["bin"] == "extract"
+    assert cat["version"] == ex.__version__
+    assert [c["name"] for c in cat["commands"]] == [
+        "extract", "schema", "fields", "demo", "completion"
+    ]
+    for c in cat["commands"]:
+        assert set(c) == {"name", "help", "flags"} and c["help"]
+    assert cat["exitCodes"]["0"] and cat["exitCodes"]["1"] and cat["exitCodes"]["2"]
+def test_catalog_defaults_to_json(capsys: pytest.CaptureFixture[str]) -> None:
+    assert ex.main(["--catalog"]) == 0          # bare --catalog → json
+    json.loads(capsys.readouterr().out)
+    assert ex.main(["--catalog=json"]) == 0     # = form
+    json.loads(capsys.readouterr().out)
+def test_catalog_rejects_unknown_format(capsys: pytest.CaptureFixture[str]) -> None:
+    assert ex.main(["--catalog", "yaml"]) == 2
+    assert "error:" in capsys.readouterr().err
+def test_catalog_does_not_drift_from_parser() -> None:
+    """The catalog must list exactly the commands/flags the real parser accepts."""
+    cat = ex.build_catalog()
+    parser = ex.build_parser()
+    sub_action = next(
+        a for a in parser._actions if isinstance(a, argparse._SubParsersAction)
+    )
+    real: dict[str, argparse.ArgumentParser] = dict(sub_action.choices)
+    cat_by_name = {c["name"]: c for c in cat["commands"]}
+    assert set(cat_by_name) == set(real)  # no fictional or undocumented commands
+    for name, subparser in real.items():
+        documented: Set[str] = set()
+        for f in cat_by_name[name]["flags"]:
+            documented.add(f["name"])
+            documented.update(f["aliases"])
+        assert documented == _parser_optstrings(subparser), f"flag drift in `{name}`"

{extract_cli-0.1.6 → extract_cli-0.1.8}/tests/test_misc.py RENAMED Viewed

@@ -173,6 +173,18 @@ def test_docx_heading_styles_drive_clause_map() -> None:
     assert [p["name"] for p in result["parties"]] == ["Initech Software, Inc.", "Globex Corporation"]
+def test_numbered_docx_clauses() -> None:
+    """A DOCX whose clauses are w:numPr list paragraphs (no heading style, no
+    visible number) still yields a clause map; a deep numbered body sentence is
+    excluded."""
+    raw, text, fmt, _w = ex.load_source(FIXTURES / "numbered_docx.docx", prefer_optional=False)
+    result = ex.build_extraction(text, raw, fmt, "numbered_docx.docx")
+    canon = {c["canonical_title"] for c in result["clauses"]}
+    assert {"Definitions", "Confidentiality", "Governing Law"} <= canon
+    assert not any("remains fully liable" in c["detected_title"] for c in result["clauses"])
+    assert [p["name"] for p in result["parties"]][0] == "Globex Cloud, Inc."
 def test_html_extraction() -> None:
     raw, text, fmt, _w = ex.load_source(FIXTURES / "services_html.html")
     assert fmt == "html"