contract-archive-cli 0.2.7__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- contract_archive/__init__.py +2 -0
- contract_archive/archive/__init__.py +64 -0
- contract_archive/archive/db.py +126 -0
- contract_archive/archive/ingest.py +667 -0
- contract_archive/archive/migrations/001_init.sql +62 -0
- contract_archive/archive/migrations/002_obligations.sql +25 -0
- contract_archive/archive/migrations/003_document_types.sql +31 -0
- contract_archive/archive/migrations/004_seals_subjects.sql +36 -0
- contract_archive/archive/migrations/005_completeness.sql +18 -0
- contract_archive/archive/party_registry.py +276 -0
- contract_archive/archive/paths.py +113 -0
- contract_archive/archive/repository.py +918 -0
- contract_archive/cli.py +455 -0
- contract_archive/cli_common.py +293 -0
- contract_archive/cli_config.py +96 -0
- contract_archive/cli_introspect.py +204 -0
- contract_archive/cli_party.py +166 -0
- contract_archive/cli_query.py +492 -0
- contract_archive/cli_render.py +575 -0
- contract_archive/config.py +257 -0
- contract_archive/errors.py +163 -0
- contract_archive/extraction/__init__.py +14 -0
- contract_archive/extraction/amount_check.py +87 -0
- contract_archive/extraction/contract_extractor.py +103 -0
- contract_archive/extraction/document_extractor.py +546 -0
- contract_archive/extraction/evidence_page_fix.py +99 -0
- contract_archive/extraction/llm_extractor.py +207 -0
- contract_archive/extraction/normalize.py +210 -0
- contract_archive/extraction/property_fee.py +79 -0
- contract_archive/extraction/vision_seal.py +390 -0
- contract_archive/pipelines/__init__.py +9 -0
- contract_archive/pipelines/mineru_pipeline.py +955 -0
- contract_archive/pipelines/vl_ocr.py +160 -0
- contract_archive/schemas/__init__.py +67 -0
- contract_archive/schemas/document.py +408 -0
- contract_archive/utils/__init__.py +27 -0
- contract_archive/utils/device.py +51 -0
- contract_archive/utils/http_env.py +54 -0
- contract_archive/utils/pdf.py +207 -0
- contract_archive_cli-0.2.7.dist-info/METADATA +386 -0
- contract_archive_cli-0.2.7.dist-info/RECORD +44 -0
- contract_archive_cli-0.2.7.dist-info/WHEEL +4 -0
- contract_archive_cli-0.2.7.dist-info/entry_points.txt +2 -0
- contract_archive_cli-0.2.7.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,386 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: contract-archive-cli
|
|
3
|
+
Version: 0.2.7
|
|
4
|
+
Summary: 本地文档档案库 CLI:OCR 解析 + qwen3.7-max 字段抽取 + SQLite 索引(合同/证明/发票等)
|
|
5
|
+
Author: crhan
|
|
6
|
+
License: MIT
|
|
7
|
+
License-File: LICENSE
|
|
8
|
+
Requires-Python: <3.13,>=3.10
|
|
9
|
+
Requires-Dist: click<9,>=8.0
|
|
10
|
+
Requires-Dist: dashscope>=1.22.2
|
|
11
|
+
Requires-Dist: openai>=1.40
|
|
12
|
+
Requires-Dist: pillow>=10.0
|
|
13
|
+
Requires-Dist: pydantic>=2.6
|
|
14
|
+
Requires-Dist: pymupdf>=1.24
|
|
15
|
+
Requires-Dist: python-dotenv>=1.0
|
|
16
|
+
Requires-Dist: rich>=13.7
|
|
17
|
+
Requires-Dist: socksio>=1.0
|
|
18
|
+
Requires-Dist: tenacity>=8.2
|
|
19
|
+
Requires-Dist: typer<0.26,>=0.12
|
|
20
|
+
Provides-Extra: dev
|
|
21
|
+
Requires-Dist: ipython>=8.0; extra == 'dev'
|
|
22
|
+
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
|
|
23
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
24
|
+
Requires-Dist: ruff>=0.5; extra == 'dev'
|
|
25
|
+
Provides-Extra: mineru
|
|
26
|
+
Requires-Dist: mineru[core]<4.0,>=3.1; extra == 'mineru'
|
|
27
|
+
Requires-Dist: scipy>=1.10; extra == 'mineru'
|
|
28
|
+
Description-Content-Type: text/markdown
|
|
29
|
+
|
|
30
|
+
# 本地文档档案库 CLI
|
|
31
|
+
|
|
32
|
+
> 把各类文档 PDF 批量入库、归档、可追溯——合同、协议、证明、发票、报告……
|
|
33
|
+
> MinerU 解析版面文本,qwen3.7-max LLM 判类型 + 抽字段,索引到本地 SQLite,
|
|
34
|
+
> 支持按类型/字段过滤检索。
|
|
35
|
+
|
|
36
|
+
**LLM-first**:文档类型与字段抽取都交给 LLM,统一归一化到一个「通用信封」
|
|
37
|
+
(doc_type / title / summary / 主体 / 金额 / 日期 / 柔性字段)。加新文档类型
|
|
38
|
+
**无需写代码**——LLM 自行决定抽什么。死代码 rule 仅保留为确定性数值归一化
|
|
39
|
+
(中文大写金额→数值、日期→ISO)。合同另有一份调校过的专属 prompt(同样纯 LLM),
|
|
40
|
+
仍保留全部合同字段与查询(甲乙方/到期日/自动续约/风险条款/义务清单)。
|
|
41
|
+
|
|
42
|
+
历史:本项目最初是 DashScope / PaddleOCR / MinerU 三路 OCR 对比 playground,
|
|
43
|
+
选定 MinerU 后重构为档案库 CLI;再从「合同专用」扩展为「通用文档档案库」。
|
|
44
|
+
|
|
45
|
+
## ✦ 数据流
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
PDF ─► sha256 去重 ─► MinerU 解析 ─► LLM 判类型 + 抽字段(合同走专属 prompt,纯 LLM)
|
|
49
|
+
│
|
|
50
|
+
┌───────────────────────────┴──┐
|
|
51
|
+
▼ ▼
|
|
52
|
+
db.sqlite (通用信封 + 索引) documents/<sha-12>/
|
|
53
|
+
├── source.pdf (硬链接)
|
|
54
|
+
├── mineru/markdown.md ...
|
|
55
|
+
├── extraction_result.json (通用信封)
|
|
56
|
+
└── ingest.log
|
|
57
|
+
|
|
58
|
+
档案库默认在 XDG 数据目录:~/.local/share/contract-archive/
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
## ✦ 安装
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# 1) 装 uv
|
|
65
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
66
|
+
|
|
67
|
+
# 2) 装依赖(mineru extras 会拉 MinerU 包,首次跑 ingest 还会从 modelscope 下模型 >1GB)
|
|
68
|
+
./scripts/setup.sh mineru
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
> **uv hardlink 坑**:uv 默认 `UV_LINK_MODE=hardlink` 偶发只装包的一部分文件
|
|
72
|
+
> (实测 `cv2`/`pptx` 会丢,触发 `module 'cv2' has no attribute 'INTER_NEAREST'`
|
|
73
|
+
> 或 `cannot import name 'Presentation' from 'pptx'`)。`scripts/setup.sh` 已
|
|
74
|
+
> 显式 `export UV_LINK_MODE=copy` 规避。手动 `uv sync` 时建议也带上。
|
|
75
|
+
> 已损坏的包可以 `uv pip install --force-reinstall --no-deps <包名>` 修。
|
|
76
|
+
|
|
77
|
+
如果只想用 list/search/show 查已有的档案库(机器上有别的人 ingest 好的 db.sqlite),
|
|
78
|
+
跳过 mineru extras 也可以:
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
./scripts/setup.sh base
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## ✦ 全局安装(可选)
|
|
85
|
+
|
|
86
|
+
如果想在任意目录用 `contract-archive`(不必 `cd` 项目目录或 `uv run`),用 `uv tool install`:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
# 注意是 ".[mineru]"(项目的 mineru extra = mineru[core],含 torch 等模型依赖)。
|
|
90
|
+
# 不要用 `--with mineru`——裸 mineru 不带 [core],装出来的工具跑 ingest 会
|
|
91
|
+
# ModuleNotFoundError: No module named 'torch'。
|
|
92
|
+
# 用 --reinstall 而非 --force:版本号没变时 --force 会命中 uv 缓存里的旧 wheel,
|
|
93
|
+
# 把过时代码装进去(实测会停在旧版本);--reinstall 强制重建,更新才可靠。
|
|
94
|
+
UV_LINK_MODE=copy uv tool install --reinstall "/path/to/contract-archive-cli[mineru]"
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
`uv tool install` 会在 `~/.local/bin/contract-archive` 装独立 venv(与项目 venv 隔离)。
|
|
98
|
+
然后从任意目录:
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
# 用环境变量指定档案库
|
|
102
|
+
CONTRACT_ARCHIVE_DIR=~/contracts contract-archive list
|
|
103
|
+
|
|
104
|
+
# 或显式 --archive(per-command 选项,放在子命令之后)
|
|
105
|
+
contract-archive list --archive ~/contracts
|
|
106
|
+
contract-archive ingest ~/Documents/new_contract.pdf --archive ~/contracts
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
`DASHSCOPE_API_KEY` 需通过 shell env 提供(建议放进 `~/.zshrc` 或专用 shell wrapper)。
|
|
110
|
+
|
|
111
|
+
**开发者(改了源码要即时生效)**:加 `--editable`,全局命令指向本仓库源码而非快照——
|
|
112
|
+
改完 `.py` 直接生效,不必每次重装:
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
UV_LINK_MODE=copy uv tool install --editable --reinstall "/path/to/contract-archive-cli[mineru]"
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
> 不加 `--editable` 装的是「当下代码的快照」:之后改了源码、或仓库升了版本,都得
|
|
119
|
+
> 重新 `--reinstall` 才更新。如果发现 `contract-archive --version` 跟仓库 `pyproject.toml`
|
|
120
|
+
> 对不上,多半就是装了旧快照——重装即可。
|
|
121
|
+
|
|
122
|
+
卸载:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
uv tool uninstall contract-archive-cli
|
|
126
|
+
# 数据/配置不随之删除,需手动清理:
|
|
127
|
+
# ~/.local/share/contract-archive 档案库数据(db.sqlite + documents/)
|
|
128
|
+
# ~/.config/contract-archive config.json
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## ✦ 配置
|
|
132
|
+
|
|
133
|
+
两种方式,优先级 **环境变量(含 .env) > config 文件 > 默认值**:
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
# 方式一:config 命令(落 ~/.config/contract-archive/config.json,权限 0600,比项目 .env 更安全)
|
|
137
|
+
contract-archive config set dashscope.api_key sk-xxx
|
|
138
|
+
contract-archive config show # 看各项当前生效值与来源(secret 默认掩码)
|
|
139
|
+
contract-archive config show --format json # 机读:含 key/env/secret/default/value/source
|
|
140
|
+
contract-archive config unset dashscope.api_key
|
|
141
|
+
|
|
142
|
+
# 方式二:项目 .env(老方式,仍支持)
|
|
143
|
+
cp .env.example .env
|
|
144
|
+
$EDITOR .env # 填入 DASHSCOPE_API_KEY
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
| 环境变量 | config 键 | 说明 |
|
|
148
|
+
| --- | --- | --- |
|
|
149
|
+
| `DASHSCOPE_API_KEY` | `dashscope.api_key` | 必填。[百炼控制台](https://dashscope.console.aliyun.com/) 申请 |
|
|
150
|
+
| `DASHSCOPE_LLM_MODEL` | `dashscope.model` | 默认 `qwen3.7-max`(用户百炼账户的特定别名;若 404 换 `qwen-max` / `qwen3-max`) |
|
|
151
|
+
| `DASHSCOPE_BASE_URL` | `dashscope.base_url` | 默认 `https://dashscope.aliyuncs.com/api/v1`;海外换 `https://dashscope-intl.aliyuncs.com/api/v1` |
|
|
152
|
+
| `DASHSCOPE_VL_MODEL` | `dashscope.vl_model` | 签章核查视觉模型,默认 `qwen3.6-flash` |
|
|
153
|
+
| `CONTRACT_ARCHIVE_DIR` | `archive.dir` | 档案库根目录,默认 XDG `~/.local/share/contract-archive`;CLI `--archive` 优先 |
|
|
154
|
+
| `COMPUTE_DEVICE` | — | `auto` / `mps` / `cuda` / `cpu`(MinerU 走子进程,主要影响其内部 backend 选择) |
|
|
155
|
+
| `LOG_LEVEL` | — | `DEBUG`/`INFO`/`WARNING`/...,默认 `INFO`;`--verbose`/`--quiet` 覆盖之 |
|
|
156
|
+
| `DASHSCOPE_TIMEOUT_S` | — | LLM/VL 调用超时秒数,默认 `300` |
|
|
157
|
+
| `CONTRACT_ARCHIVE_MINERU_TIMEOUT_S` | — | MinerU 子进程解析超时秒数,默认 `1800` |
|
|
158
|
+
| `MINERU_MODEL_SOURCE` | — | MinerU 模型源,默认 `modelscope`(国内快);海外可 export `huggingface` |
|
|
159
|
+
|
|
160
|
+
> 标 `—` 的是运行时旋钮,保持 env-only、不进 config 文件层。
|
|
161
|
+
|
|
162
|
+
## ✦ 用法
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
# 入库单个 PDF
|
|
166
|
+
uv run contract-archive ingest path/to/合同.pdf
|
|
167
|
+
|
|
168
|
+
# 批量入库整个目录(递归扫 *.pdf,sha256 去重)
|
|
169
|
+
uv run contract-archive ingest ~/Documents/contracts/
|
|
170
|
+
|
|
171
|
+
# 跳过 LLM(无 API key 时也用):仅入库 MinerU 产物,抽取字段留空,可后续 extract 补抽
|
|
172
|
+
uv run contract-archive ingest path/to/合同.pdf --no-llm
|
|
173
|
+
|
|
174
|
+
# 强制重跑(已 ingest 过的也再跑一遍,覆盖旧记录)
|
|
175
|
+
uv run contract-archive ingest path/to/合同.pdf --reingest
|
|
176
|
+
|
|
177
|
+
# 试跑前 3 个
|
|
178
|
+
uv run contract-archive ingest ~/Documents/contracts/ --limit 3
|
|
179
|
+
|
|
180
|
+
# 成本/进度(agent 友好)
|
|
181
|
+
uv run contract-archive ingest ~/Documents/contracts/ --dry-run # 只预览扫到几个、预计几次 API 调用,不建库不烧钱
|
|
182
|
+
uv run contract-archive ingest ~/Documents/contracts/ --max-files 20 # 超 20 个直接报错退出,防误喂大目录
|
|
183
|
+
uv run contract-archive ingest ~/Documents/contracts/ --progress ndjson # 每文件一行 JSON 事件,供 agent 流式消费
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
### 查询
|
|
187
|
+
|
|
188
|
+
```bash
|
|
189
|
+
# 列出全部(按入库时间倒序,默认 50 条)
|
|
190
|
+
uv run contract-archive list
|
|
191
|
+
|
|
192
|
+
# 按签订日排序,只看 partial 的
|
|
193
|
+
uv run contract-archive list --order-by sign_date --status partial
|
|
194
|
+
|
|
195
|
+
# 输出 JSON 供脚本消费
|
|
196
|
+
uv run contract-archive list --format json | jq '.[] | .contract_name'
|
|
197
|
+
|
|
198
|
+
# 多字段过滤(全部 AND)
|
|
199
|
+
uv run contract-archive search --party 张三 --amount-min 100000 --signed-after 2024-01-01
|
|
200
|
+
uv run contract-archive search --expire-before 2026-12-31 --has-risk
|
|
201
|
+
uv run contract-archive search --name 车位 --auto-renewal
|
|
202
|
+
|
|
203
|
+
# 看单条详情(id 或 sha 前缀 ≥4 字符)
|
|
204
|
+
uv run contract-archive show 5
|
|
205
|
+
uv run contract-archive show a3f9c2b1
|
|
206
|
+
|
|
207
|
+
# 看原文:show 看 LLM 抽出的字段,raw 看抽取依据的 OCR 原始文本(同一份喂给 LLM 的内容)
|
|
208
|
+
# 交互终端下按抽取来源给命中关键字着色(当事人/金额/日期/风险/字段),一眼看出哪些被识别到
|
|
209
|
+
uv run contract-archive raw 5
|
|
210
|
+
uv run contract-archive raw a3f9c2b1 | grep 违约 # 管道时自动纯文本,不破坏 grep
|
|
211
|
+
uv run contract-archive raw 5 --color always | less -R # 强制上色配 less -R
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### 待办看板(义务清单)
|
|
215
|
+
|
|
216
|
+
每份合同抽取时会拆出双方"动作"(递交资料/付款/交付/签字等)作为
|
|
217
|
+
独立的 `obligations` 表,每条带 `actor` (甲方/乙方/双方) + `deadline`:
|
|
218
|
+
|
|
219
|
+
```bash
|
|
220
|
+
# 跨合同列出所有待办(按 deadline 升序,NULL 排最后)
|
|
221
|
+
contract-archive todo --include-undated
|
|
222
|
+
|
|
223
|
+
# 未来 30 天内要做的事
|
|
224
|
+
contract-archive todo --within-days 30
|
|
225
|
+
|
|
226
|
+
# 只看甲方任务 / 只看乙方任务
|
|
227
|
+
contract-archive todo --actor party_a
|
|
228
|
+
contract-archive todo --actor party_b --before 2026-12-31
|
|
229
|
+
|
|
230
|
+
# 找"近 30 天内有截止动作的合同"(不是单条 obligation,而是合同列)
|
|
231
|
+
contract-archive search --deadline-before 2026-06-30 --actor party_b
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
`contract-archive show <id>` 会按甲方/乙方/双方分组展示该合同所有义务,
|
|
235
|
+
与原本的 `risk_clauses`(违约罚则)严格区分。
|
|
236
|
+
|
|
237
|
+
### 身份核对(known_parties 基准库)
|
|
238
|
+
|
|
239
|
+
抽取时把每个主体(自然人/机构)与其固有标识(身份证号/电话/银行账号/开户行/税号…)
|
|
240
|
+
**精确绑定到人**(`person_identities`),不像扁平字段那样把多人号码混成一条。
|
|
241
|
+
入库时与 `known_parties` 基准库比对,采用「首见入库、再见校对」:
|
|
242
|
+
|
|
243
|
+
- **首见**:某主体的某标识第一次出现 → 录入为基准(记首见出处)。
|
|
244
|
+
- **再见**:同主体同标识再出现 → 与基准比对,不一致即在 `show` 的「身份核对」块报
|
|
245
|
+
`identity` 缺陷(疑似 OCR 读错或信息被改),**不覆盖基准**。
|
|
246
|
+
- 比较前归一化剥离分隔符噪声(空格/;/:不误报),但多/少/错位的真实数字差异会被抓出。
|
|
247
|
+
- 不分自然人/机构——身份证、电话、银行账号、开户行一律核对。
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
contract-archive party list # 列出所有已知主体及标识
|
|
251
|
+
contract-archive party show 张三 # 查看某主体的标识基准
|
|
252
|
+
contract-archive party set 张三 身份证号 1101... # 手动修正基准(纠正被 OCR 读错的首见值)
|
|
253
|
+
contract-archive party rm 张三 电话 # 删除某标识;省略标识则删整个主体
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
> 基准库 `known_parties.json` 存档案库根目录,**含真实 PII**(身份证/电话/账号),
|
|
257
|
+
> 文件权限 0600、列入 `.gitignore`,绝不入库或分享。
|
|
258
|
+
|
|
259
|
+
### 抽取层管理
|
|
260
|
+
|
|
261
|
+
LLM 跑挂或想升级 prompt 后批量再抽取——不重跑 MinerU:
|
|
262
|
+
|
|
263
|
+
```bash
|
|
264
|
+
uv run contract-archive extract 5 # 复跑 id=5 的抽取
|
|
265
|
+
uv run contract-archive extract 5 --no-llm # 跳过 LLM(抽取字段留空,rule 已退役)
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
### 统计与维护
|
|
269
|
+
|
|
270
|
+
```bash
|
|
271
|
+
uv run contract-archive stats # 总数 / status 分布 / 按月签订 / 近 30 天到期
|
|
272
|
+
uv run contract-archive delete 5 # 默认仅删 DB 行,交互确认
|
|
273
|
+
uv run contract-archive delete 5 --purge-files -y # 同时删 archive/documents/<sha>/,无确认
|
|
274
|
+
uv run contract-archive vacuum # 大批量 ingest 后整理碎片
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
> **注意**:`delete` 不会删用户原 PDF 文件——`source_path` 字段记录的是入库时
|
|
278
|
+
> 的源路径,源文件归用户所有。
|
|
279
|
+
|
|
280
|
+
### 印章总览
|
|
281
|
+
|
|
282
|
+
```bash
|
|
283
|
+
uv run contract-archive seals # 跨文档列全部印章
|
|
284
|
+
uv run contract-archive seals --seal-owner 示例公司 # 某主体的章(--owner 同义)
|
|
285
|
+
uv run contract-archive seals --seal-type 合同专用章 # 按印章类型(--type 同义)
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### 机器发现 / agent 接入
|
|
289
|
+
|
|
290
|
+
把本 CLI 包成 MCP / OpenAI tool,或让 agent 自动调用时,用这几个命令免去硬编码——输出皆 JSON:
|
|
291
|
+
|
|
292
|
+
```bash
|
|
293
|
+
uv run contract-archive capabilities # 全部命令 + 副作用/破坏性/幂等元数据
|
|
294
|
+
uv run contract-archive describe ingest # 单命令参数 schema(名称/类型/必填/默认/可选值)
|
|
295
|
+
uv run contract-archive schema document # 核心数据结构 JSON Schema(document/contract/confidence/error)
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
数据命令(list/search/show/stats/todo/seals/party/extract/ingest)都支持 `--format json`,
|
|
299
|
+
stdout 纯净可 `| jq`;失败结果带结构化 `error`(`code`/`category`/`retryable`),供 agent 判是否重试。
|
|
300
|
+
|
|
301
|
+
## ✦ 档案库目录结构
|
|
302
|
+
|
|
303
|
+
```
|
|
304
|
+
archive/
|
|
305
|
+
├── db.sqlite # 索引表
|
|
306
|
+
├── db.sqlite-wal / -shm # WAL 模式产物(运行时)
|
|
307
|
+
├── ingest.jsonl # 总日志(每次 ingest 一行 JSON)
|
|
308
|
+
└── documents/
|
|
309
|
+
└── a3f9c2b1/ # sha256 前 12 位
|
|
310
|
+
├── source.pdf # 硬链接源 PDF(跨盘 fallback copy)
|
|
311
|
+
├── mineru/
|
|
312
|
+
│ ├── markdown.md
|
|
313
|
+
│ ├── layout.json # bbox 已归一到 PDF point
|
|
314
|
+
│ ├── structured.json
|
|
315
|
+
│ ├── raw_text.txt
|
|
316
|
+
│ ├── pipeline_meta.json
|
|
317
|
+
│ └── preview_images/
|
|
318
|
+
├── extraction_result.json # 抽取字段(通用信封)
|
|
319
|
+
├── extraction_confidence.json
|
|
320
|
+
└── ingest.log # 单合同 stderr
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
## ✦ Docker
|
|
324
|
+
|
|
325
|
+
```bash
|
|
326
|
+
docker build -t contract-archive -f docker/Dockerfile .
|
|
327
|
+
docker run --rm -it \
|
|
328
|
+
-v $PWD/archive:/app/archive \
|
|
329
|
+
-v $PWD/input:/app/input \
|
|
330
|
+
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
|
331
|
+
--env-file .env \
|
|
332
|
+
contract-archive uv run contract-archive ingest /app/input
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
挂载 modelscope 缓存复用本机 MinerU 模型。Mac 容器不直通 GPU,强烈推荐 native venv 跑。
|
|
336
|
+
|
|
337
|
+
## ✦ 项目结构
|
|
338
|
+
|
|
339
|
+
```
|
|
340
|
+
contract-archive-cli/
|
|
341
|
+
├── pyproject.toml # uv 依赖管理(extras: mineru)
|
|
342
|
+
├── docker/Dockerfile
|
|
343
|
+
├── .env.example
|
|
344
|
+
├── scripts/
|
|
345
|
+
│ └── setup.sh
|
|
346
|
+
├── contract_archive/
|
|
347
|
+
│ ├── cli.py # 入口 main_entry + 写命令 ingest/extract/delete/vacuum + 组装
|
|
348
|
+
│ ├── cli_common.py # app 实例 + 全局 callback + 参数 Enum + 双 console + 路径/ident 解析
|
|
349
|
+
│ ├── cli_query.py # 只读命令 list/search/show/raw/stats/todo/seals
|
|
350
|
+
│ ├── cli_config.py # config show/set/unset 子命令组
|
|
351
|
+
│ ├── cli_party.py # party list/show/set/rm(known_parties PII 基准库)
|
|
352
|
+
│ ├── cli_introspect.py # capabilities/describe/schema 机器发现命令
|
|
353
|
+
│ ├── cli_render.py # 纯渲染层(Table / JSON dict / raw 高亮)
|
|
354
|
+
│ ├── schemas/ # pydantic schema(BBox/LayoutBlock/DocumentExtraction 等)
|
|
355
|
+
│ ├── pipelines/
|
|
356
|
+
│ │ └── mineru_pipeline.py # MinerU subprocess 调用 + 坐标归一化 + markdown 清洗
|
|
357
|
+
│ ├── extraction/ # 纯 LLM 抽取(rule/hybrid 自 Phase 2 退役)
|
|
358
|
+
│ │ ├── document_extractor.py # 通用文档判类型 + 抽信封
|
|
359
|
+
│ │ ├── contract_extractor.py # 合同专属字段(专属 prompt)
|
|
360
|
+
│ │ ├── llm_extractor.py # DashScope OpenAI 兼容口调用
|
|
361
|
+
│ │ ├── vision_seal.py # 落款页 VL 签章核查
|
|
362
|
+
│ │ ├── normalize.py / amount_check.py / evidence_page_fix.py / property_fee.py
|
|
363
|
+
│ ├── archive/
|
|
364
|
+
│ │ ├── db.py # SQLite 连接 + migrations 引擎
|
|
365
|
+
│ │ ├── repository.py # DAO + 搜索查询构造
|
|
366
|
+
│ │ ├── ingest.py # 入库流水线(hash → MinerU → extract → rename → DB)
|
|
367
|
+
│ │ ├── party_registry.py # known_parties 身份基准库
|
|
368
|
+
│ │ ├── paths.py # 档案库路径约定 + 硬链接工具
|
|
369
|
+
│ │ └── migrations/ # 001_init … 005_completeness(5 个)
|
|
370
|
+
│ ├── errors.py # 结构化错误模型(code/category/retryable)
|
|
371
|
+
│ ├── config.py # XDG 配置 + env>file>default
|
|
372
|
+
│ └── utils/ # 设备选择 / PyMuPDF PDF 渲染
|
|
373
|
+
├── archive/ # 档案库数据(gitignored)
|
|
374
|
+
├── input/ # 用户放待处理 PDF
|
|
375
|
+
└── output.legacy/ # 旧 pipeline 历史产物(重构前的对比数据,可删)
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
## ✦ 设计纪律
|
|
379
|
+
|
|
380
|
+
- **统一 schema**:MinerU 的 0-1000 归一化坐标全部反算成 PDF point;markdown 反斜杠转义在喂给抽取层前清洗
|
|
381
|
+
- **纯 LLM 抽取**:自 Phase 2 退役 rule/hybrid,字段全由 LLM 抽取;每字段附 `value_source`(仅 `llm`/`missing`)+ 置信度。死代码 rule 仅保留为确定性数值归一化(中文大写金额→数值、日期→ISO)
|
|
382
|
+
- **API key 不出包**:仅从 env 读,日志不打印响应体
|
|
383
|
+
- **sha256 去重**:流式 hash 后查 UNIQUE 索引;命中即 skip 避免 MinerU 跑一次几分钟才发现重复
|
|
384
|
+
- **事务边界**:tmp 目录跑全 → `os.rename` 到 documents/ → DB INSERT;任一阶段失败回滚干净,DB 不留半成品
|
|
385
|
+
- **partial 状态可修复**:MinerU OK 但 LLM 挂时 markdown 仍可用,`extract <id>` 命令只重跑抽取层
|
|
386
|
+
- **不并发 MinerU**:每个 subprocess 会加载 GB 级模型,并发反而 OOM;默认 workers=1
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
contract_archive/__init__.py,sha256=N9b0h5EVwXMhVp10Ayyaz9oB0A9tWQWBK1ydtwxvA74,113
|
|
2
|
+
contract_archive/cli.py,sha256=azTCcWNvScmE7m1F4wmF0ZvI76xYLT8Z_QVeOd7Ymj4,18376
|
|
3
|
+
contract_archive/cli_common.py,sha256=mOAVoGJmZdv2JWORGsS7oJ4PPeaKhn86kJ-Kt7CzcVw,10417
|
|
4
|
+
contract_archive/cli_config.py,sha256=KKIITBQ1qlwoxgMDtIul0Rc7SJYl-9-k21fQx08dPOw,3855
|
|
5
|
+
contract_archive/cli_introspect.py,sha256=fJspKQN5LcIzkDP7FTlAus5j9dvKnN_aVL99V7ZAkiQ,9475
|
|
6
|
+
contract_archive/cli_party.py,sha256=EzzNiI6KggqYNKlaOhjzicTV3wHRmPD_OLaMmZT6MeY,7582
|
|
7
|
+
contract_archive/cli_query.py,sha256=d9PcmisRfeAa91HEm4rfWbvbFiReFoNZv96FNt9Hr0w,17248
|
|
8
|
+
contract_archive/cli_render.py,sha256=zEP7ql3sPfp8YGJ5CLcQKrtWVBSLQUyFEkRJpMX3f5c,23530
|
|
9
|
+
contract_archive/config.py,sha256=l1ljfo57Uym-8jQXxOblJugbqGtAEOYZYOTUp2Ym6pw,10988
|
|
10
|
+
contract_archive/errors.py,sha256=bGkBlNFicM7G0o8yCrYYRzH1I_pzRhZhCM2MuPOUAPM,6981
|
|
11
|
+
contract_archive/archive/__init__.py,sha256=1jST49wGbPKC59Jzd4D4H1GCwLjzLm4W8YAiZGklxqc,1561
|
|
12
|
+
contract_archive/archive/db.py,sha256=Hk1X4iOhYcvwJlA8MEvxVqDgnO0s_vExjAek-9s6Y8I,4338
|
|
13
|
+
contract_archive/archive/ingest.py,sha256=nYgyRFMfFzO6AxK4c1kQRY13B86oJ6hAbcXbF39vmtA,25729
|
|
14
|
+
contract_archive/archive/party_registry.py,sha256=PEuVuFafaQK8jT__Aq9GEYRuQR4ehwK7UpMzVQQbxX0,13432
|
|
15
|
+
contract_archive/archive/paths.py,sha256=UB8sET1NDs_jWR2XLlyhxYWbwXB9e7nmdi-HbuFmo-I,3637
|
|
16
|
+
contract_archive/archive/repository.py,sha256=3TwsAe3A9pNr4WMnZOlFg9dfFVZdqXmUtXbMX9If53Y,31323
|
|
17
|
+
contract_archive/archive/migrations/001_init.sql,sha256=aQYV7uFygcISZaOxv-ZGTwSOSHebcfe5mLUyb7dYz8I,3036
|
|
18
|
+
contract_archive/archive/migrations/002_obligations.sql,sha256=ZaKaVSWIAIxJAsM4MCJcXip8YLZYpjffDkvEfI4vGHU,1187
|
|
19
|
+
contract_archive/archive/migrations/003_document_types.sql,sha256=1laoz1dLg14vzyZvLGt_QPamWgGHb2cVo_HID5R-9u4,2117
|
|
20
|
+
contract_archive/archive/migrations/004_seals_subjects.sql,sha256=DmF1Pg_u_E8wbkx2B0fE2kl69-VqKJGcbykMZqhWUUg,2164
|
|
21
|
+
contract_archive/archive/migrations/005_completeness.sql,sha256=4Czsjdr3ttfjAWyUPkIaDLpg_vN2SQSeJGbe_uAjOiw,1228
|
|
22
|
+
contract_archive/extraction/__init__.py,sha256=oh4rj1vnH6CvZvb0_dz-P9mrTekxEq3HYMGDtWbMMlI,423
|
|
23
|
+
contract_archive/extraction/amount_check.py,sha256=G-BYm_WN9VjaPBXEVe-u5LsGpRebPZLMz0B27UrOYu0,4087
|
|
24
|
+
contract_archive/extraction/contract_extractor.py,sha256=iJhI-3m0tdDd6Ne_UQ4PnujIizN74qlD16-JphGvX1M,3859
|
|
25
|
+
contract_archive/extraction/document_extractor.py,sha256=dld3mCYUVLOtc0rBx1vVKRrEFMYgs8KKWytRMlP3xEY,33051
|
|
26
|
+
contract_archive/extraction/evidence_page_fix.py,sha256=QqOSSWuk_H5SMc3MfjgQKLLvcXp2QPPOlIdPjsAf6DM,4093
|
|
27
|
+
contract_archive/extraction/llm_extractor.py,sha256=Rfr6fe6K7dkuN1mlKLc908ZDt8mmzuxIJmet659evFc,9463
|
|
28
|
+
contract_archive/extraction/normalize.py,sha256=ep3-ZeqG6PZo3rZFAspYsk9GyYLyyi7x5LhIdDbHCjc,7255
|
|
29
|
+
contract_archive/extraction/property_fee.py,sha256=NSi18i7EdERkOfdJvAWwEvRfNDd7PG14wTT26SN2Xz0,3599
|
|
30
|
+
contract_archive/extraction/vision_seal.py,sha256=e_N-Zd5O5Kq0ZJrlyGk6cYT_nQ5AKr0DSQCqIL2CUQg,18006
|
|
31
|
+
contract_archive/pipelines/__init__.py,sha256=c-rjsQs11CcmQbyZgXIhREnluO8jjQJJXkACf-JlkiY,273
|
|
32
|
+
contract_archive/pipelines/mineru_pipeline.py,sha256=B3aZBSOCsHgZdEsnMjLB7lUsQ5G9580Q9fwiuIF1RGE,33061
|
|
33
|
+
contract_archive/pipelines/vl_ocr.py,sha256=Uu5EnJk5Hq8B-mgiQ6ydOWirU9kY9_KXwOdCxjeE1MM,6675
|
|
34
|
+
contract_archive/schemas/__init__.py,sha256=5AK6fWlHWbMdOIWI5hp93Wy03HO_zmMHbjagQ4HiSe8,1275
|
|
35
|
+
contract_archive/schemas/document.py,sha256=ujKsOxhZJtF19iIQY6Ti2qUKFhjPxckAlkPX2jGSAec,18737
|
|
36
|
+
contract_archive/utils/__init__.py,sha256=T9s-MpY3BNQ0JYTNSP6yJfbCySqyA-BRW8XlsUV40L8,561
|
|
37
|
+
contract_archive/utils/device.py,sha256=IhI_MMm3UjE0NPUlLCrv6n0MuRt3QoolpZLeMCCmw-w,1463
|
|
38
|
+
contract_archive/utils/http_env.py,sha256=3FJRmAvf6FyevpfEazveC_5VQ3qCcj1vWT6_lMW7XPs,1685
|
|
39
|
+
contract_archive/utils/pdf.py,sha256=wbMs3ZYO1D5XNntcR0pqkvaAI0IKa87UZNsLi6gjzaA,6503
|
|
40
|
+
contract_archive_cli-0.2.7.dist-info/METADATA,sha256=JXKrrOpJzIjVKlotSUP7qhgkNOKEVbPPD1-Affae2UE,18855
|
|
41
|
+
contract_archive_cli-0.2.7.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
|
|
42
|
+
contract_archive_cli-0.2.7.dist-info/entry_points.txt,sha256=FaCRhque-fIEWQ3yHAVEoceroCL0ZB2vPKfxyI5L0M4,69
|
|
43
|
+
contract_archive_cli-0.2.7.dist-info/licenses/LICENSE,sha256=scC-b_caxbJBCkr0ioJkMF_8fTKUNDEcR463CD0TWhQ,1062
|
|
44
|
+
contract_archive_cli-0.2.7.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 crhan
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|