extract-cli 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. extract_cli-0.1.0/.gitignore +25 -0
  2. extract_cli-0.1.0/ARCHITECTURE.md +95 -0
  3. extract_cli-0.1.0/CHANGELOG.md +60 -0
  4. extract_cli-0.1.0/CONTRIBUTING.md +74 -0
  5. extract_cli-0.1.0/LICENSE +21 -0
  6. extract_cli-0.1.0/Makefile +84 -0
  7. extract_cli-0.1.0/PKG-INFO +225 -0
  8. extract_cli-0.1.0/README.md +187 -0
  9. extract_cli-0.1.0/config/llm.json.example +7 -0
  10. extract_cli-0.1.0/docs/INTEROP.md +139 -0
  11. extract_cli-0.1.0/docs/spec/extract-output.schema.json +298 -0
  12. extract_cli-0.1.0/extract_cli.py +1710 -0
  13. extract_cli-0.1.0/pyproject.toml +87 -0
  14. extract_cli-0.1.0/scripts/release.py +75 -0
  15. extract_cli-0.1.0/scripts/validate_against_spec.py +98 -0
  16. extract_cli-0.1.0/tests/_fixtures_build.py +181 -0
  17. extract_cli-0.1.0/tests/_make_goldens.py +44 -0
  18. extract_cli-0.1.0/tests/_schema_validator.py +97 -0
  19. extract_cli-0.1.0/tests/conftest.py +43 -0
  20. extract_cli-0.1.0/tests/fixtures/employment_docx.docx +0 -0
  21. extract_cli-0.1.0/tests/fixtures/employment_docx.docx.expected.json +147 -0
  22. extract_cli-0.1.0/tests/fixtures/lease_allcaps.txt +25 -0
  23. extract_cli-0.1.0/tests/fixtures/lease_allcaps.txt.expected.json +142 -0
  24. extract_cli-0.1.0/tests/fixtures/license_pdf.pdf +90 -0
  25. extract_cli-0.1.0/tests/fixtures/license_pdf.pdf.expected.json +142 -0
  26. extract_cli-0.1.0/tests/fixtures/nda_h2.md +31 -0
  27. extract_cli-0.1.0/tests/fixtures/nda_h2.md.expected.json +147 -0
  28. extract_cli-0.1.0/tests/fixtures/scanned.pdf +34 -0
  29. extract_cli-0.1.0/tests/fixtures/scanned.pdf.expected.json +57 -0
  30. extract_cli-0.1.0/tests/fixtures/services_bold.txt +27 -0
  31. extract_cli-0.1.0/tests/fixtures/services_bold.txt.expected.json +142 -0
  32. extract_cli-0.1.0/tests/test_clause_map.py +91 -0
  33. extract_cli-0.1.0/tests/test_cli.py +111 -0
  34. extract_cli-0.1.0/tests/test_deterministic.py +118 -0
  35. extract_cli-0.1.0/tests/test_llm.py +110 -0
  36. extract_cli-0.1.0/tests/test_misc.py +160 -0
  37. extract_cli-0.1.0/tests/test_property.py +106 -0
  38. extract_cli-0.1.0/tests/test_schema_conformance.py +64 -0
@@ -0,0 +1,25 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+
7
+ # Build artifacts
8
+ build/
9
+ dist/
10
+
11
+ # Test / type / coverage caches
12
+ .pytest_cache/
13
+ .mypy_cache/
14
+ .coverage
15
+ htmlcov/
16
+
17
+ # Virtualenvs
18
+ .venv/
19
+ venv/
20
+
21
+ # Local LLM config (never commit credentials)
22
+ config/llm.json
23
+
24
+ # OS
25
+ .DS_Store
@@ -0,0 +1,95 @@
1
+ # Architecture
2
+
3
+ `extract-cli` is one file, `extract_cli.py`, stdlib-only. This document is the
4
+ map.
5
+
6
+ ## Pipeline
7
+
8
+ ```
9
+ load_source(path) extension/content sniff → reader
10
+ ├─ .md/.txt → utf-8 decode
11
+ ├─ .docx → python-docx (if [docx]) else stdlib zipfile/XML reader
12
+ └─ .pdf → pypdf (if [pdf]) else stdlib zlib + text-operator reader
13
+
14
+ ▼ (raw_bytes, text, format, warnings)
15
+ build_extraction(text, raw, fmt, src) the DETERMINISTIC tier (always on)
16
+ ├─ extract_parties "between X and Y", with role parentheticals
17
+ ├─ extract_dates effective / expiration, ISO-normalized
18
+ ├─ extract_term length / auto_renew / notice_period_days
19
+ ├─ extract_governing_law "governed by the laws of …"
20
+ ├─ extract_clauses detect_clauses cascade → canonical mapping
21
+ ├─ extract_defined_terms quoted / parenthetical Capitalized terms
22
+ └─ extract_value headline monetary amount
23
+
24
+ ▼ result : dict (the output contract)
25
+ llm_enrich(result, text, args) the LLM tier — only if --llm
26
+
27
+
28
+ render_json | render_table stdout (JSON is the machine payload)
29
+ ```
30
+
31
+ Each extracted scalar is wrapped by `_field(value, confidence, source)` into a
32
+ `{value, confidence, source}` envelope; "not found" is the canonical
33
+ `{value: null, confidence: 0.0, source: "none"}`. Lists (`parties`, `clauses`,
34
+ `defined_terms`) carry per-item `confidence`/`source`. `_meta` records the
35
+ extractor version, the tiers that ran, and whether the LLM was used. This is
36
+ the "verify, not trust" contract downstream tools consume.
37
+
38
+ ## The clause map
39
+
40
+ `detect_clauses(text)` is a faithful port of template-vault-cli's three-tier
41
+ cascade; the first tier that fires wins so fallbacks never shadow real
42
+ structure:
43
+
44
+ 1. **`h2`** — `## Heading` (Markdown-native). Needs ≥ 1 match.
45
+ 2. **`bold-numbered`** — `**1. Purpose**`, `**Section 4. Term**` (typical of
46
+ DOCX → text). Needs ≥ 2 matches.
47
+ 3. **`all-caps`** — blank-line-framed `CONFIDENTIALITY` lines (typical of legal
48
+ PDFs), with the single-token-≥-4-letters rule. Needs ≥ 2 matches.
49
+
50
+ `_strip_clause_number` removes leading numbering, including Roman numerals
51
+ 1–39 (`_ROMAN_RE` lists longer alternatives first so the engine doesn't
52
+ short-circuit on a prefix — bare `V`/`X` match).
53
+
54
+ `_canonicalize_clause` then maps each detected title onto the suite's shared
55
+ vocabulary via `CANONICAL_CLAUSE_ALIASES` (canonical_title → [alias, …]):
56
+ exact normalized match first, then a containment fallback. Unmapped clauses are
57
+ kept with `mapped: false` and a lower confidence so nothing is silently
58
+ dropped. template-vault stores this map *per template*; a foreign document has
59
+ none, so extract-cli ships a built-in default — that's the differentiator.
60
+
61
+ ## Readers: degrade up, not down
62
+
63
+ The playbook gates heavy parsing behind `[docx]`/`[pdf]`. We honor the spirit
64
+ (extras improve fidelity) while keeping `.docx`/`.pdf` working with zero extras
65
+ via stdlib readers, because the hard rule is "fully functional with zero
66
+ extras + degrade gracefully" — and a best-effort reader serves that better than
67
+ refusing the format. See the decision note in
68
+ [CHANGELOG.md](CHANGELOG.md). A scanned/image-only PDF yields no text → a
69
+ stderr warning and exit code `1` (a "finding"), never a crash.
70
+
71
+ ## LLM tier
72
+
73
+ Opt-in only (`--llm`), never in a hot path. `load_llm_config()` reads the
74
+ suite-shared config (`~/.config/contract-ops/llm.json` then `./config/llm.json`).
75
+ `_llm_request` posts via stdlib `urllib` to Anthropic or an OpenAI-compatible
76
+ endpoint. Any failure (no config, network error, unparseable JSON) is caught:
77
+ a warning to stderr, deterministic output untouched. The LLM only *adds* fuzzy
78
+ fields (`term.renewal_mechanics`, `obligations`) and fills `governing_law` only
79
+ when the deterministic tier found nothing — it never overwrites a deterministic
80
+ value.
81
+
82
+ ## The output contract
83
+
84
+ `output_schema()` is the single source of truth for the JSON Schema. `extract
85
+ schema` prints it; `docs/spec/extract-output.schema.json` is the committed copy
86
+ (`make spec-check` asserts they're identical). Tests validate every fixture's
87
+ output against it with a vendored, dependency-free validator
88
+ (`tests/_schema_validator.py`).
89
+
90
+ ## Conventions
91
+
92
+ UTF-8 stdout/stderr is forced in `main()` (locale-safe on macOS CI). Color
93
+ auto-detects a TTY and honors `NO_COLOR`/`FORCE_COLOR`. stdout is reserved for
94
+ the machine payload; `--why`, warnings, and errors go to stderr. Exit codes:
95
+ `0` success, `1` low-signal document, `2` bad usage / user-actionable error.
@@ -0,0 +1,60 @@
1
+ # Changelog
2
+
3
+ All notable changes to `extract-cli` are documented here. The format follows
4
+ [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); this project adheres
5
+ to [Semantic Versioning](https://semver.org/). Per the suite convention
6
+ (see [`docs/INTEROP.md`](docs/INTEROP.md)), **backward-incompatible changes to
7
+ the output schema require a major version bump**; new optional fields are minor.
8
+
9
+ ## [0.1.0] - 2026-05-21
10
+
11
+ Initial release — the open-loop front door of the contract-ops CLI suite.
12
+
13
+ ### Added
14
+ - Single-file, stdlib-only CLI `extract_cli.py` (`extract` entry point).
15
+ - `extract <path>` — parse `.md`/`.txt`/`.docx`/`.pdf` into structured JSON.
16
+ - `extract schema` / `fields` / `demo` / `completion` subcommands; hidden
17
+ `__complete` handler for shell completion.
18
+ - **Two explicit extraction tiers**: a deterministic, network-free default
19
+ (parties, dates, defined terms, clause map, governing law, best-effort
20
+ term/notice/value) and an opt-in `--llm` tier for fuzzy fields. Every field
21
+ carries a `confidence` and a `source` ∈ {deterministic, llm, none}.
22
+ - **Clause map**: ported template-vault-cli's three-tier clause-detection
23
+ cascade (H2 → bold-numbered → ALL-CAPS, Roman numerals 1–39 with longer
24
+ alternatives first) plus a built-in canonical clause-alias vocabulary that
25
+ normalizes a foreign document's clause titles onto the suite's shared names.
26
+ - Cross-CLI output contract published as JSON Schema 2020-12 at
27
+ [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json)
28
+ and registered in [`docs/INTEROP.md`](docs/INTEROP.md).
29
+ - Suite UX conventions: `--json`/`--why`/`-q`/`--silent`/`--no-color`
30
+ (`NO_COLOR`/`FORCE_COLOR` honored)/`--demo`/`-V`; meaningful exit codes
31
+ (0/1/2); locale-safe UTF-8 stdout/stderr; ASCII-safe output.
32
+ - Shared LLM config lookup (`~/.config/contract-ops/llm.json` → `./config/llm.json`)
33
+ with `config/llm.json.example`.
34
+ - Test suite: per-tier unit tests, seeded property-based invariants (stdlib
35
+ `random.Random`, no hypothesis), a real-contract fixture corpus spanning all
36
+ input formats and clause tiers with `.expected.json` goldens, and a
37
+ schema-conformance test using a dependency-free JSON Schema validator.
38
+ - CI matrix (Ubuntu × macOS × Python 3.9–3.12) + typecheck + build-smoke jobs;
39
+ PyPI Trusted Publishing workflow on `v*` tags. `Makefile` and
40
+ `scripts/release.py`.
41
+
42
+ ### Decisions (documented per the autonomous-build playbook)
43
+ - **`.docx`/`.pdf` work without their extras.** The playbook says heavy parsing
44
+ lives behind `[docx]`/`[pdf]`; we honor the spirit (extras enhance fidelity)
45
+ while degrading *up*, not down: stdlib `zipfile`/XML reads `.docx` and a
46
+ stdlib `zlib`/text-operator reader reads `.pdf` out of the box. The extras
47
+ (`python-docx`, `pypdf`) are preferred when installed. Rationale: the
48
+ playbook's hard rule is "fully functional on `.md`/`.txt` with zero extras
49
+ and degrade gracefully" — a stdlib best-effort reader satisfies that *and*
50
+ the graceful-degradation rule better than refusing the format outright.
51
+ - **`[llm]` extra is empty.** LLM enrichment uses only stdlib `urllib`, so no
52
+ runtime dependency is required; the extra exists for suite parity.
53
+ - **Schema validation in tests uses a vendored mini-validator**, not
54
+ `jsonschema`, to keep the dev surface stdlib-aligned (dev extra is just
55
+ pytest/coverage/mypy/build).
56
+ - **`--no-confidence`** produces a reduced convenience projection that is
57
+ intentionally *not* governed by the output schema (the schema describes the
58
+ full default output).
59
+
60
+ [0.1.0]: https://github.com/DrBaher/extract-cli/releases/tag/v0.1.0
@@ -0,0 +1,74 @@
1
+ # Contributing to extract-cli
2
+
3
+ Thanks for helping with the contract-ops suite's front door.
4
+
5
+ ## Ground rules (suite conventions)
6
+
7
+ - **Stdlib-only core.** `extract_cli.py` must import nothing outside the
8
+ standard library. Heavy parsing goes behind the `[docx]`/`[pdf]` optional
9
+ extras and must degrade gracefully when they're absent.
10
+ - **Single file.** All CLI logic lives in `extract_cli.py`.
11
+ - **The LLM tier is opt-in.** Never call it in a default code path; it must be
12
+ fully skippable and the deterministic tier must stand alone.
13
+ - **stdout is the machine payload.** JSON only. Diagnostics, `--why`, and
14
+ errors go to stderr. Don't mix them.
15
+ - **ASCII-safe, locale-safe output.** `ensure_ascii=True` for JSON; no raw
16
+ unicode in table output.
17
+
18
+ ## Dev setup
19
+
20
+ ```bash
21
+ python -m venv .venv && . .venv/bin/activate
22
+ make install # editable install with [dev] extra
23
+ ```
24
+
25
+ ## Before you push
26
+
27
+ ```bash
28
+ make typecheck # mypy --strict must be clean
29
+ make test # full suite must be green
30
+ make coverage # keep coverage healthy
31
+ make spec-check # docs/spec schema must match `extract schema`
32
+ make smoke # wheel installs+runs in a clean venv
33
+ ```
34
+
35
+ CI runs the matrix (Ubuntu × macOS × Python 3.9–3.12) plus a typecheck job and
36
+ a build-smoke job. All must pass.
37
+
38
+ ## Changing the output schema
39
+
40
+ The output JSON is a **cross-CLI data contract** (see
41
+ [`docs/INTEROP.md`](docs/INTEROP.md)). Semver matters:
42
+
43
+ - **New optional field** → minor bump.
44
+ - **Removing/renaming a field, narrowing a type** → major bump.
45
+
46
+ Edit `output_schema()` in `extract_cli.py` (the source of truth), then:
47
+
48
+ ```bash
49
+ make spec-update # regenerate docs/spec/extract-output.schema.json
50
+ make goldens # regenerate .expected.json (review the diff!)
51
+ ```
52
+
53
+ ## Tests
54
+
55
+ - Per-tier unit tests in `tests/test_deterministic.py`, `tests/test_clause_map.py`.
56
+ - Property-based invariants in `tests/test_property.py` use stdlib
57
+ `random.Random(seed)` — **no hypothesis**.
58
+ - The fixture corpus (`tests/fixtures/`) spans all input formats and clause
59
+ tiers, each with a `.expected.json` golden. Binary fixtures are generated by
60
+ `tests/_fixtures_build.py` (run `make fixtures`).
61
+ - `tests/test_schema_conformance.py` validates every fixture against the schema
62
+ using the vendored validator in `tests/_schema_validator.py`.
63
+
64
+ Add a fixture + golden for any new format or clause-shape you teach it to read.
65
+
66
+ ## Commits
67
+
68
+ Commit identity for this repo is **`DrBaher <Drbaher@gmail.com>`**. The
69
+ `scripts/release.py` flow sets it; for manual commits run:
70
+
71
+ ```bash
72
+ git config user.name "DrBaher"
73
+ git config user.email "Drbaher@gmail.com"
74
+ ```
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 DrBaher
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,84 @@
1
+ # extract-cli -- developer tasks. Stdlib-only project; these wrap the toolchain.
2
+ PYTHON ?= python3
3
+ PIP ?= $(PYTHON) -m pip
4
+ SPEC := docs/spec/extract-output.schema.json
5
+
6
+ .PHONY: help install test test-quick coverage typecheck build smoke \
7
+ spec-check fixtures goldens release clean
8
+
9
+ help:
10
+ @echo "extract-cli make targets:"
11
+ @echo " install editable install with [dev] extra"
12
+ @echo " test run the full test suite"
13
+ @echo " test-quick run tests, stop on first failure (-x), skip slow property runs"
14
+ @echo " coverage run tests under coverage and print a report"
15
+ @echo " typecheck mypy --strict"
16
+ @echo " build build wheel + sdist into dist/"
17
+ @echo " smoke build, install the wheel in a clean venv, and run it"
18
+ @echo " spec-check assert docs/spec schema == 'extract schema' output"
19
+ @echo " fixtures regenerate binary (.docx/.pdf) test fixtures"
20
+ @echo " goldens regenerate .expected.json golden files"
21
+ @echo " release make release VERSION=X.Y.Z"
22
+ @echo " clean remove build/test artifacts"
23
+
24
+ install:
25
+ $(PIP) install -e ".[dev]"
26
+
27
+ test:
28
+ $(PYTHON) -m pytest
29
+
30
+ test-quick:
31
+ $(PYTHON) -m pytest -x -q -k "not property"
32
+
33
+ coverage:
34
+ $(PYTHON) -m coverage run --source=extract_cli -m pytest -q
35
+ $(PYTHON) -m coverage report -m
36
+
37
+ typecheck:
38
+ $(PYTHON) -m mypy --strict extract_cli.py
39
+
40
+ build: clean
41
+ $(PYTHON) -m build
42
+
43
+ smoke: build
44
+ @set -e; \
45
+ tmp=$$(mktemp -d); \
46
+ echo "smoke venv: $$tmp"; \
47
+ $(PYTHON) -m venv $$tmp/venv; \
48
+ whl=$$(ls dist/*.whl | head -1); \
49
+ $$tmp/venv/bin/python -m pip install -q --upgrade pip; \
50
+ $$tmp/venv/bin/python -m pip install -q "$$whl"; \
51
+ echo "--- extract --version ---"; $$tmp/venv/bin/extract --version; \
52
+ echo "--- extract demo (validate against spec) ---"; \
53
+ $$tmp/venv/bin/extract demo --no-color | $(PYTHON) scripts/validate_against_spec.py; \
54
+ echo "smoke OK"; \
55
+ rm -rf $$tmp
56
+
57
+ spec-check:
58
+ @$(PYTHON) extract_cli.py schema > /tmp/extract-spec-check.json
59
+ @if diff -q /tmp/extract-spec-check.json $(SPEC) >/dev/null 2>&1; then \
60
+ echo "spec-check OK: $(SPEC) matches 'extract schema'"; \
61
+ else \
62
+ echo "spec-check FAILED: $(SPEC) is stale. Run: make spec-update"; \
63
+ diff $(SPEC) /tmp/extract-spec-check.json || true; \
64
+ exit 1; \
65
+ fi
66
+
67
+ .PHONY: spec-update
68
+ spec-update:
69
+ $(PYTHON) extract_cli.py schema > $(SPEC)
70
+ @echo "wrote $(SPEC)"
71
+
72
+ fixtures:
73
+ $(PYTHON) tests/_fixtures_build.py
74
+
75
+ goldens:
76
+ $(PYTHON) tests/_make_goldens.py
77
+
78
+ release:
79
+ @test -n "$(VERSION)" || { echo "usage: make release VERSION=X.Y.Z"; exit 2; }
80
+ $(PYTHON) scripts/release.py $(VERSION)
81
+
82
+ clean:
83
+ rm -rf dist build *.egg-info .pytest_cache .mypy_cache .coverage htmlcov
84
+ find . -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true
@@ -0,0 +1,225 @@
1
+ Metadata-Version: 2.4
2
+ Name: extract-cli
3
+ Version: 0.1.0
4
+ Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON.
5
+ Project-URL: Homepage, https://cli.drbaher.com/
6
+ Project-URL: Repository, https://github.com/DrBaher/extract-cli
7
+ Project-URL: Suite interop, https://github.com/DrBaher/extract-cli/blob/main/docs/INTEROP.md
8
+ Author-email: DrBaher <Drbaher@gmail.com>
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: clause,cli,contract,extraction,json,legal,nda
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Legal Industry
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Office/Business
24
+ Classifier: Topic :: Text Processing :: Markup
25
+ Classifier: Typing :: Typed
26
+ Requires-Python: >=3.9
27
+ Provides-Extra: dev
28
+ Requires-Dist: build>=1.0; extra == 'dev'
29
+ Requires-Dist: coverage>=7.0; extra == 'dev'
30
+ Requires-Dist: mypy>=1.0; extra == 'dev'
31
+ Requires-Dist: pytest>=7.0; extra == 'dev'
32
+ Provides-Extra: docx
33
+ Requires-Dist: python-docx>=0.8.11; extra == 'docx'
34
+ Provides-Extra: llm
35
+ Provides-Extra: pdf
36
+ Requires-Dist: pypdf>=3.9.0; extra == 'pdf'
37
+ Description-Content-Type: text/markdown
38
+
39
+ # extract-cli
40
+
41
+ > Part of the contract-ops CLI suite. **extract-cli** is the suite's
42
+ > *passport control* — the **open-loop front door**. The rest of the suite is a
43
+ > closed loop that only handles documents it authored from its own templates;
44
+ > `extract-cli` ingests **any** document (yours or a counterparty's foreign
45
+ > paper) and emits a structured representation the pipeline can consume:
46
+ > [**template-vault-cli**](https://github.com/DrBaher/template-vault-CLI) (storage) feeds
47
+ > [**draft-cli**](https://github.com/DrBaher/draft-cli) (fill placeholders) →
48
+ > [**nda-review-cli**](https://github.com/DrBaher/nda-review-cli) (review, redline, negotiate) →
49
+ > [**docx2pdf-cli**](https://github.com/DrBaher/docx2pdf-cli) (DOCX → PDF) →
50
+ > [**sign-cli**](https://github.com/DrBaher/sign-cli) (signing + audit).
51
+ > Cross-version drift detection via [**compare-cli**](https://github.com/DrBaher/compare-cli).
52
+ > [Showcase site](https://cli.drbaher.com/).
53
+ >
54
+ > `extract-cli` sits **upstream of review**: it turns foreign paper into the
55
+ > suite's canonical, structured vocabulary. Its output is a **cross-CLI data
56
+ > contract** — see [`docs/INTEROP.md`](docs/INTEROP.md) and
57
+ > [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json).
58
+
59
+ ```
60
+ ingest (extract) → review → diff → convert → sign
61
+ ^you are here
62
+ ```
63
+
64
+ ## What it does
65
+
66
+ Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
67
+ and it returns structured JSON: the parties, dates, term, governing law, a
68
+ **clause map** normalized onto the suite's canonical clause vocabulary, a
69
+ defined-term inventory, and a headline value. Every field carries a
70
+ `confidence` and a `source` so downstream tools **verify, don't trust**.
71
+
72
+ It is **stdlib-only**, single-file, terminal-first, and composable. No DB, no
73
+ daemon, no network in the default path.
74
+
75
+ ## Install
76
+
77
+ ```bash
78
+ pip install extract-cli # core: .md/.txt + best-effort .docx/.pdf
79
+ pip install "extract-cli[docx]" # higher-fidelity .docx (python-docx)
80
+ pip install "extract-cli[pdf]" # higher-fidelity .pdf (pypdf)
81
+ pip install "extract-cli[docx,pdf]" # both
82
+ ```
83
+
84
+ The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
85
+ with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
86
+ `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
87
+ [ARCHITECTURE.md](ARCHITECTURE.md)).
88
+
89
+ ## The two extraction tiers
90
+
91
+ `extract-cli` is explicit about *how* it knows each field — encoded in every
92
+ field's `source` and in `_meta.tiers_used`.
93
+
94
+ | Tier | When | Fields | Network? |
95
+ |---|---|---|---|
96
+ | **deterministic** | always on (default) | parties, dates, defined terms, **clause map**, governing law, best-effort term/notice/value | none |
97
+ | **llm** | opt-in via `--llm` only | renewal mechanics, obligation phrasing, ambiguous governing law | yes (your provider) |
98
+
99
+ The deterministic core is **fully useful without the LLM**. The LLM tier is
100
+ opt-in, never in a hot path, and gated behind an explicit flag and a config
101
+ file — if no config is present, `--llm` degrades gracefully with a warning and
102
+ you still get the full deterministic output.
103
+
104
+ ## Commands
105
+
106
+ ```bash
107
+ extract <path> # parse a document → structured JSON on stdout (default)
108
+ extract schema # print the output JSON Schema (the cross-CLI contract)
109
+ extract fields # list extractable fields and their tier
110
+ extract demo # run on a bundled fixture and show the narrative
111
+ extract completion bash # emit a shell-completion script (bash|zsh)
112
+ ```
113
+
114
+ ### Flags
115
+
116
+ | Flag | Meaning |
117
+ |---|---|
118
+ | `--llm` | Opt-in LLM enrichment of fuzzy fields (off by default) |
119
+ | `--fields a,b,c` | Emit only a subset of top-level fields (e.g. `parties,clauses`) |
120
+ | `--format json\|table` | Output format (default `json`) |
121
+ | `--no-confidence` | Omit confidence/source markers (reduced convenience view) |
122
+ | `--json` | Force JSON to stdout (the default) |
123
+ | `--why` | Rationale block on **stderr** |
124
+ | `-q`, `--silent`, `--quiet` | Suppress non-error diagnostics |
125
+ | `--no-color` | Disable ANSI color (also honors `NO_COLOR` / `FORCE_COLOR`) |
126
+ | `-V`, `--version` | Print `extract-cli X.Y.Z` |
127
+
128
+ Streams follow the suite convention: **stdout** is the machine payload (JSON),
129
+ **stderr** is for humans (`--why`, warnings, errors). Exit codes: `0` success,
130
+ `1` low-signal document (e.g. a scanned/empty PDF), `2` bad usage.
131
+
132
+ ## Output shape (abridged)
133
+
134
+ ```jsonc
135
+ {
136
+ "document": { "title": "...", "format": "markdown", "sha256": "…", "source_path": "nda.md" },
137
+ "parties": [ { "name": "Acme Robotics, Inc.", "role": "Disclosing Party", "confidence": 0.9, "source": "deterministic" } ],
138
+ "dates": { "effective": { "value": "2024-03-01", "confidence": 0.85, "source": "deterministic" }, "expiration": { "value": null, "confidence": 0.0, "source": "none" } },
139
+ "term": { "length": { "value": "3 years", ... }, "auto_renew": { "value": true, ... }, "notice_period_days": { "value": 60, ... } },
140
+ "governing_law": { "value": "State of Delaware", "confidence": 0.85, "source": "deterministic" },
141
+ "clauses": [ { "canonical_title": "Confidentiality", "detected_title": "## Confidentiality Obligations", "tier": "h2", "span": {"start": 0, "end": 120}, "confidence": 0.95, "source": "deterministic", "mapped": true } ],
142
+ "defined_terms": [ { "term": "Confidential Information", "confidence": 0.6, "source": "deterministic" } ],
143
+ "value": { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
144
+ "_meta": { "extractor_version": "0.1.0", "tiers_used": ["deterministic"], "llm_used": false }
145
+ }
146
+ ```
147
+
148
+ ## The clause map (the differentiator)
149
+
150
+ A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's
151
+ "## Confidentiality" are the same clause. `extract-cli` reuses
152
+ template-vault-cli's **clause-detection cascade** (Tier 1 `## H2` headings →
153
+ Tier 2 bold-numbered `**1. …**` → Tier 3 ALL-CAPS lines) and a built-in
154
+ **canonical alias vocabulary** to normalize foreign clause titles onto the
155
+ names the rest of the suite already speaks. Clauses it can't map are kept with
156
+ `mapped: false` (and a `*` in the table view) so nothing is silently dropped.
157
+
158
+ ```bash
159
+ extract counterparty.pdf | jq '.clauses[] | {canonical_title, detected_title, mapped}'
160
+ ```
161
+
162
+ ## Composability — piping into the rest of the suite
163
+
164
+ `extract-cli` is built to be the first stage of a Unix pipe. Its JSON is the
165
+ contract every downstream tool reads.
166
+
167
+ ```bash
168
+ # 1) Foreign NDA → review. extract normalizes clauses; nda-review runs policy.
169
+ extract counterparty_nda.pdf | nda-review review --from-extract -
170
+
171
+ # 2) Pull just the clause map and feed compare-cli to diff a foreign doc
172
+ # against your canonical template's structure.
173
+ extract their_msa.docx --fields clauses | compare-cli align --stdin \
174
+ --against msa/standard
175
+
176
+ # 3) Archive structured metadata for any inbound paper into the post-signature
177
+ # vault, keyed by content hash.
178
+ extract signed_contract.pdf | contract-vault put --from-extract - \
179
+ --id "$(extract signed_contract.pdf | jq -r .document.sha256)"
180
+
181
+ # 4) Triage a folder of inbound contracts: list governing law + parties.
182
+ for f in inbox/*.pdf; do
183
+ extract "$f" --fields parties,governing_law --no-confidence \
184
+ | jq -c '{file: input_filename, gov: .governing_law, parties: [.parties[].name]}'
185
+ done
186
+
187
+ # 5) Gate a workflow on extraction confidence.
188
+ extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo "ok to review"
189
+ ```
190
+
191
+ > The `--from-extract`/`--stdin` flags above are the consumption points the
192
+ > sibling CLIs expose (or are adopting) for this contract; see
193
+ > [`docs/INTEROP.md`](docs/INTEROP.md) for the shared conventions and the
194
+ > versioning commitment on the schema.
195
+
196
+ ## LLM configuration (opt-in)
197
+
198
+ `--llm` reads a shared suite config, in this order:
199
+
200
+ 1. `~/.config/contract-ops/llm.json` (suite-wide — preferred)
201
+ 2. `./config/llm.json` (repo-local override)
202
+
203
+ Copy [`config/llm.json.example`](config/llm.json.example) to one of those
204
+ paths. Configure it once and every suite tool that adopts the same lookup gets
205
+ LLM features for free. Without it, `--llm` just warns and returns the
206
+ deterministic output.
207
+
208
+ ## Development
209
+
210
+ ```bash
211
+ make install # editable install with the [dev] extra
212
+ make test # full suite
213
+ make coverage # suite + coverage report
214
+ make typecheck # mypy --strict
215
+ make build # wheel + sdist
216
+ make smoke # build, install the wheel in a clean venv, run it
217
+ make spec-check # assert docs/spec schema == `extract schema`
218
+ make release VERSION=X.Y.Z
219
+ ```
220
+
221
+ See [ARCHITECTURE.md](ARCHITECTURE.md) and [CONTRIBUTING.md](CONTRIBUTING.md).
222
+
223
+ ## License
224
+
225
+ MIT — see [LICENSE](LICENSE).