extract-cli 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,225 @@
1
+ Metadata-Version: 2.4
2
+ Name: extract-cli
3
+ Version: 0.1.0
4
+ Summary: Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON.
5
+ Project-URL: Homepage, https://cli.drbaher.com/
6
+ Project-URL: Repository, https://github.com/DrBaher/extract-cli
7
+ Project-URL: Suite interop, https://github.com/DrBaher/extract-cli/blob/main/docs/INTEROP.md
8
+ Author-email: DrBaher <Drbaher@gmail.com>
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: clause,cli,contract,extraction,json,legal,nda
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Legal Industry
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Office/Business
24
+ Classifier: Topic :: Text Processing :: Markup
25
+ Classifier: Typing :: Typed
26
+ Requires-Python: >=3.9
27
+ Provides-Extra: dev
28
+ Requires-Dist: build>=1.0; extra == 'dev'
29
+ Requires-Dist: coverage>=7.0; extra == 'dev'
30
+ Requires-Dist: mypy>=1.0; extra == 'dev'
31
+ Requires-Dist: pytest>=7.0; extra == 'dev'
32
+ Provides-Extra: docx
33
+ Requires-Dist: python-docx>=0.8.11; extra == 'docx'
34
+ Provides-Extra: llm
35
+ Provides-Extra: pdf
36
+ Requires-Dist: pypdf>=3.9.0; extra == 'pdf'
37
+ Description-Content-Type: text/markdown
38
+
39
+ # extract-cli
40
+
41
+ > Part of the contract-ops CLI suite. **extract-cli** is the suite's
42
+ > *passport control* — the **open-loop front door**. The rest of the suite is a
43
+ > closed loop that only handles documents it authored from its own templates;
44
+ > `extract-cli` ingests **any** document (yours or a counterparty's foreign
45
+ > paper) and emits a structured representation the pipeline can consume:
46
+ > [**template-vault-cli**](https://github.com/DrBaher/template-vault-CLI) (storage) feeds
47
+ > [**draft-cli**](https://github.com/DrBaher/draft-cli) (fill placeholders) →
48
+ > [**nda-review-cli**](https://github.com/DrBaher/nda-review-cli) (review, redline, negotiate) →
49
+ > [**docx2pdf-cli**](https://github.com/DrBaher/docx2pdf-cli) (DOCX → PDF) →
50
+ > [**sign-cli**](https://github.com/DrBaher/sign-cli) (signing + audit).
51
+ > Cross-version drift detection via [**compare-cli**](https://github.com/DrBaher/compare-cli).
52
+ > [Showcase site](https://cli.drbaher.com/).
53
+ >
54
+ > `extract-cli` sits **upstream of review**: it turns foreign paper into the
55
+ > suite's canonical, structured vocabulary. Its output is a **cross-CLI data
56
+ > contract** — see [`docs/INTEROP.md`](docs/INTEROP.md) and
57
+ > [`docs/spec/extract-output.schema.json`](docs/spec/extract-output.schema.json).
58
+
59
+ ```
60
+ ingest (extract) → review → diff → convert → sign
61
+ ^you are here
62
+ ```
63
+
64
+ ## What it does
65
+
66
+ Give it a contract in **`.md` / `.txt`** (native), **`.docx`**, or **`.pdf`**,
67
+ and it returns structured JSON: the parties, dates, term, governing law, a
68
+ **clause map** normalized onto the suite's canonical clause vocabulary, a
69
+ defined-term inventory, and a headline value. Every field carries a
70
+ `confidence` and a `source` so downstream tools **verify, don't trust**.
71
+
72
+ It is **stdlib-only**, single-file, terminal-first, and composable. No DB, no
73
+ daemon, no network in the default path.
74
+
75
+ ## Install
76
+
77
+ ```bash
78
+ pip install extract-cli # core: .md/.txt + best-effort .docx/.pdf
79
+ pip install "extract-cli[docx]" # higher-fidelity .docx (python-docx)
80
+ pip install "extract-cli[pdf]" # higher-fidelity .pdf (pypdf)
81
+ pip install "extract-cli[docx,pdf]" # both
82
+ ```
83
+
84
+ The core has **zero runtime dependencies** and is fully functional on `.md`/`.txt`
85
+ with no extras. `.docx` and `.pdf` work out of the box via stdlib readers; the
86
+ `[docx]`/`[pdf]` extras improve fidelity on complex documents (see
87
+ [ARCHITECTURE.md](ARCHITECTURE.md)).
88
+
89
+ ## The two extraction tiers
90
+
91
+ `extract-cli` is explicit about *how* it knows each field — encoded in every
92
+ field's `source` and in `_meta.tiers_used`.
93
+
94
+ | Tier | When | Fields | Network? |
95
+ |---|---|---|---|
96
+ | **deterministic** | always on (default) | parties, dates, defined terms, **clause map**, governing law, best-effort term/notice/value | none |
97
+ | **llm** | opt-in via `--llm` only | renewal mechanics, obligation phrasing, ambiguous governing law | yes (your provider) |
98
+
99
+ The deterministic core is **fully useful without the LLM**. The LLM tier is
100
+ opt-in, never in a hot path, and gated behind an explicit flag and a config
101
+ file — if no config is present, `--llm` degrades gracefully with a warning and
102
+ you still get the full deterministic output.
103
+
104
+ ## Commands
105
+
106
+ ```bash
107
+ extract <path> # parse a document → structured JSON on stdout (default)
108
+ extract schema # print the output JSON Schema (the cross-CLI contract)
109
+ extract fields # list extractable fields and their tier
110
+ extract demo # run on a bundled fixture and show the narrative
111
+ extract completion bash # emit a shell-completion script (bash|zsh)
112
+ ```
113
+
114
+ ### Flags
115
+
116
+ | Flag | Meaning |
117
+ |---|---|
118
+ | `--llm` | Opt-in LLM enrichment of fuzzy fields (off by default) |
119
+ | `--fields a,b,c` | Emit only a subset of top-level fields (e.g. `parties,clauses`) |
120
+ | `--format json\|table` | Output format (default `json`) |
121
+ | `--no-confidence` | Omit confidence/source markers (reduced convenience view) |
122
+ | `--json` | Force JSON to stdout (the default) |
123
+ | `--why` | Rationale block on **stderr** |
124
+ | `-q`, `--silent`, `--quiet` | Suppress non-error diagnostics |
125
+ | `--no-color` | Disable ANSI color (also honors `NO_COLOR` / `FORCE_COLOR`) |
126
+ | `-V`, `--version` | Print `extract-cli X.Y.Z` |
127
+
128
+ Streams follow the suite convention: **stdout** is the machine payload (JSON),
129
+ **stderr** is for humans (`--why`, warnings, errors). Exit codes: `0` success,
130
+ `1` low-signal document (e.g. a scanned/empty PDF), `2` bad usage.
131
+
132
+ ## Output shape (abridged)
133
+
134
+ ```jsonc
135
+ {
136
+ "document": { "title": "...", "format": "markdown", "sha256": "…", "source_path": "nda.md" },
137
+ "parties": [ { "name": "Acme Robotics, Inc.", "role": "Disclosing Party", "confidence": 0.9, "source": "deterministic" } ],
138
+ "dates": { "effective": { "value": "2024-03-01", "confidence": 0.85, "source": "deterministic" }, "expiration": { "value": null, "confidence": 0.0, "source": "none" } },
139
+ "term": { "length": { "value": "3 years", ... }, "auto_renew": { "value": true, ... }, "notice_period_days": { "value": 60, ... } },
140
+ "governing_law": { "value": "State of Delaware", "confidence": 0.85, "source": "deterministic" },
141
+ "clauses": [ { "canonical_title": "Confidentiality", "detected_title": "## Confidentiality Obligations", "tier": "h2", "span": {"start": 0, "end": 120}, "confidence": 0.95, "source": "deterministic", "mapped": true } ],
142
+ "defined_terms": [ { "term": "Confidential Information", "confidence": 0.6, "source": "deterministic" } ],
143
+ "value": { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
144
+ "_meta": { "extractor_version": "0.1.0", "tiers_used": ["deterministic"], "llm_used": false }
145
+ }
146
+ ```
147
+
148
+ ## The clause map (the differentiator)
149
+
150
+ A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's
151
+ "## Confidentiality" are the same clause. `extract-cli` reuses
152
+ template-vault-cli's **clause-detection cascade** (Tier 1 `## H2` headings →
153
+ Tier 2 bold-numbered `**1. …**` → Tier 3 ALL-CAPS lines) and a built-in
154
+ **canonical alias vocabulary** to normalize foreign clause titles onto the
155
+ names the rest of the suite already speaks. Clauses it can't map are kept with
156
+ `mapped: false` (and a `*` in the table view) so nothing is silently dropped.
157
+
158
+ ```bash
159
+ extract counterparty.pdf | jq '.clauses[] | {canonical_title, detected_title, mapped}'
160
+ ```
161
+
162
+ ## Composability — piping into the rest of the suite
163
+
164
+ `extract-cli` is built to be the first stage of a Unix pipe. Its JSON is the
165
+ contract every downstream tool reads.
166
+
167
+ ```bash
168
+ # 1) Foreign NDA → review. extract normalizes clauses; nda-review runs policy.
169
+ extract counterparty_nda.pdf | nda-review review --from-extract -
170
+
171
+ # 2) Pull just the clause map and feed compare-cli to diff a foreign doc
172
+ # against your canonical template's structure.
173
+ extract their_msa.docx --fields clauses | compare-cli align --stdin \
174
+ --against msa/standard
175
+
176
+ # 3) Archive structured metadata for any inbound paper into the post-signature
177
+ # vault, keyed by content hash.
178
+ extract signed_contract.pdf | contract-vault put --from-extract - \
179
+ --id "$(extract signed_contract.pdf | jq -r .document.sha256)"
180
+
181
+ # 4) Triage a folder of inbound contracts: list governing law + parties.
182
+ for f in inbox/*.pdf; do
183
+ extract "$f" --fields parties,governing_law --no-confidence \
184
+ | jq -c '{file: input_filename, gov: .governing_law, parties: [.parties[].name]}'
185
+ done
186
+
187
+ # 5) Gate a workflow on extraction confidence.
188
+ extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo "ok to review"
189
+ ```
190
+
191
+ > The `--from-extract`/`--stdin` flags above are the consumption points the
192
+ > sibling CLIs expose (or are adopting) for this contract; see
193
+ > [`docs/INTEROP.md`](docs/INTEROP.md) for the shared conventions and the
194
+ > versioning commitment on the schema.
195
+
196
+ ## LLM configuration (opt-in)
197
+
198
+ `--llm` reads a shared suite config, in this order:
199
+
200
+ 1. `~/.config/contract-ops/llm.json` (suite-wide — preferred)
201
+ 2. `./config/llm.json` (repo-local override)
202
+
203
+ Copy [`config/llm.json.example`](config/llm.json.example) to one of those
204
+ paths. Configure it once and every suite tool that adopts the same lookup gets
205
+ LLM features for free. Without it, `--llm` just warns and returns the
206
+ deterministic output.
207
+
208
+ ## Development
209
+
210
+ ```bash
211
+ make install # editable install with the [dev] extra
212
+ make test # full suite
213
+ make coverage # suite + coverage report
214
+ make typecheck # mypy --strict
215
+ make build # wheel + sdist
216
+ make smoke # build, install the wheel in a clean venv, run it
217
+ make spec-check # assert docs/spec schema == `extract schema`
218
+ make release VERSION=X.Y.Z
219
+ ```
220
+
221
+ See [ARCHITECTURE.md](ARCHITECTURE.md) and [CONTRIBUTING.md](CONTRIBUTING.md).
222
+
223
+ ## License
224
+
225
+ MIT — see [LICENSE](LICENSE).
@@ -0,0 +1,6 @@
1
+ extract_cli.py,sha256=llCZ-ekan8dyYvAPUY2FB19h6e18shyw7QMJlwKj11A,64966
2
+ extract_cli-0.1.0.dist-info/METADATA,sha256=nn5cuR9v2vW6Q9RuX_vqMnYd72POdK3sIsiEn0XGoc4,10280
3
+ extract_cli-0.1.0.dist-info/WHEEL,sha256=QccIxa26bgl1E6uMy58deGWi-0aeIkkangHcxk2kWfw,87
4
+ extract_cli-0.1.0.dist-info/entry_points.txt,sha256=0MOwRHJrGZ0ulUdamli4UJSi9nh6d_wsmPkBYgaBJr0,45
5
+ extract_cli-0.1.0.dist-info/licenses/LICENSE,sha256=7IYz6lY4eoRi4rHZvbpwPUwc8dEpQ2zO7YStwGs9aPs,1064
6
+ extract_cli-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.29.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ extract = extract_cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 DrBaher
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.