contract-archive-cli 0.2.7__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. contract_archive/__init__.py +2 -0
  2. contract_archive/archive/__init__.py +64 -0
  3. contract_archive/archive/db.py +126 -0
  4. contract_archive/archive/ingest.py +667 -0
  5. contract_archive/archive/migrations/001_init.sql +62 -0
  6. contract_archive/archive/migrations/002_obligations.sql +25 -0
  7. contract_archive/archive/migrations/003_document_types.sql +31 -0
  8. contract_archive/archive/migrations/004_seals_subjects.sql +36 -0
  9. contract_archive/archive/migrations/005_completeness.sql +18 -0
  10. contract_archive/archive/party_registry.py +276 -0
  11. contract_archive/archive/paths.py +113 -0
  12. contract_archive/archive/repository.py +918 -0
  13. contract_archive/cli.py +455 -0
  14. contract_archive/cli_common.py +293 -0
  15. contract_archive/cli_config.py +96 -0
  16. contract_archive/cli_introspect.py +204 -0
  17. contract_archive/cli_party.py +166 -0
  18. contract_archive/cli_query.py +492 -0
  19. contract_archive/cli_render.py +575 -0
  20. contract_archive/config.py +257 -0
  21. contract_archive/errors.py +163 -0
  22. contract_archive/extraction/__init__.py +14 -0
  23. contract_archive/extraction/amount_check.py +87 -0
  24. contract_archive/extraction/contract_extractor.py +103 -0
  25. contract_archive/extraction/document_extractor.py +546 -0
  26. contract_archive/extraction/evidence_page_fix.py +99 -0
  27. contract_archive/extraction/llm_extractor.py +207 -0
  28. contract_archive/extraction/normalize.py +210 -0
  29. contract_archive/extraction/property_fee.py +79 -0
  30. contract_archive/extraction/vision_seal.py +390 -0
  31. contract_archive/pipelines/__init__.py +9 -0
  32. contract_archive/pipelines/mineru_pipeline.py +955 -0
  33. contract_archive/pipelines/vl_ocr.py +160 -0
  34. contract_archive/schemas/__init__.py +67 -0
  35. contract_archive/schemas/document.py +408 -0
  36. contract_archive/utils/__init__.py +27 -0
  37. contract_archive/utils/device.py +51 -0
  38. contract_archive/utils/http_env.py +54 -0
  39. contract_archive/utils/pdf.py +207 -0
  40. contract_archive_cli-0.2.7.dist-info/METADATA +386 -0
  41. contract_archive_cli-0.2.7.dist-info/RECORD +44 -0
  42. contract_archive_cli-0.2.7.dist-info/WHEEL +4 -0
  43. contract_archive_cli-0.2.7.dist-info/entry_points.txt +2 -0
  44. contract_archive_cli-0.2.7.dist-info/licenses/LICENSE +21 -0
@@ -0,0 +1,386 @@
1
+ Metadata-Version: 2.4
2
+ Name: contract-archive-cli
3
+ Version: 0.2.7
4
+ Summary: 本地文档档案库 CLI:OCR 解析 + qwen3.7-max 字段抽取 + SQLite 索引(合同/证明/发票等)
5
+ Author: crhan
6
+ License: MIT
7
+ License-File: LICENSE
8
+ Requires-Python: <3.13,>=3.10
9
+ Requires-Dist: click<9,>=8.0
10
+ Requires-Dist: dashscope>=1.22.2
11
+ Requires-Dist: openai>=1.40
12
+ Requires-Dist: pillow>=10.0
13
+ Requires-Dist: pydantic>=2.6
14
+ Requires-Dist: pymupdf>=1.24
15
+ Requires-Dist: python-dotenv>=1.0
16
+ Requires-Dist: rich>=13.7
17
+ Requires-Dist: socksio>=1.0
18
+ Requires-Dist: tenacity>=8.2
19
+ Requires-Dist: typer<0.26,>=0.12
20
+ Provides-Extra: dev
21
+ Requires-Dist: ipython>=8.0; extra == 'dev'
22
+ Requires-Dist: pytest-cov>=5.0; extra == 'dev'
23
+ Requires-Dist: pytest>=8.0; extra == 'dev'
24
+ Requires-Dist: ruff>=0.5; extra == 'dev'
25
+ Provides-Extra: mineru
26
+ Requires-Dist: mineru[core]<4.0,>=3.1; extra == 'mineru'
27
+ Requires-Dist: scipy>=1.10; extra == 'mineru'
28
+ Description-Content-Type: text/markdown
29
+
30
+ # 本地文档档案库 CLI
31
+
32
+ > 把各类文档 PDF 批量入库、归档、可追溯——合同、协议、证明、发票、报告……
33
+ > MinerU 解析版面文本,qwen3.7-max LLM 判类型 + 抽字段,索引到本地 SQLite,
34
+ > 支持按类型/字段过滤检索。
35
+
36
+ **LLM-first**:文档类型与字段抽取都交给 LLM,统一归一化到一个「通用信封」
37
+ (doc_type / title / summary / 主体 / 金额 / 日期 / 柔性字段)。加新文档类型
38
+ **无需写代码**——LLM 自行决定抽什么。死代码 rule 仅保留为确定性数值归一化
39
+ (中文大写金额→数值、日期→ISO)。合同另有一份调校过的专属 prompt(同样纯 LLM),
40
+ 仍保留全部合同字段与查询(甲乙方/到期日/自动续约/风险条款/义务清单)。
41
+
42
+ 历史:本项目最初是 DashScope / PaddleOCR / MinerU 三路 OCR 对比 playground,
43
+ 选定 MinerU 后重构为档案库 CLI;再从「合同专用」扩展为「通用文档档案库」。
44
+
45
+ ## ✦ 数据流
46
+
47
+ ```
48
+ PDF ─► sha256 去重 ─► MinerU 解析 ─► LLM 判类型 + 抽字段(合同走专属 prompt,纯 LLM)
49
+
50
+ ┌───────────────────────────┴──┐
51
+ ▼ ▼
52
+ db.sqlite (通用信封 + 索引) documents/<sha-12>/
53
+ ├── source.pdf (硬链接)
54
+ ├── mineru/markdown.md ...
55
+ ├── extraction_result.json (通用信封)
56
+ └── ingest.log
57
+
58
+ 档案库默认在 XDG 数据目录:~/.local/share/contract-archive/
59
+ ```
60
+
61
+ ## ✦ 安装
62
+
63
+ ```bash
64
+ # 1) 装 uv
65
+ curl -LsSf https://astral.sh/uv/install.sh | sh
66
+
67
+ # 2) 装依赖(mineru extras 会拉 MinerU 包,首次跑 ingest 还会从 modelscope 下模型 >1GB)
68
+ ./scripts/setup.sh mineru
69
+ ```
70
+
71
+ > **uv hardlink 坑**:uv 默认 `UV_LINK_MODE=hardlink` 偶发只装包的一部分文件
72
+ > (实测 `cv2`/`pptx` 会丢,触发 `module 'cv2' has no attribute 'INTER_NEAREST'`
73
+ > 或 `cannot import name 'Presentation' from 'pptx'`)。`scripts/setup.sh` 已
74
+ > 显式 `export UV_LINK_MODE=copy` 规避。手动 `uv sync` 时建议也带上。
75
+ > 已损坏的包可以 `uv pip install --force-reinstall --no-deps <包名>` 修。
76
+
77
+ 如果只想用 list/search/show 查已有的档案库(机器上有别的人 ingest 好的 db.sqlite),
78
+ 跳过 mineru extras 也可以:
79
+
80
+ ```bash
81
+ ./scripts/setup.sh base
82
+ ```
83
+
84
+ ## ✦ 全局安装(可选)
85
+
86
+ 如果想在任意目录用 `contract-archive`(不必 `cd` 项目目录或 `uv run`),用 `uv tool install`:
87
+
88
+ ```bash
89
+ # 注意是 ".[mineru]"(项目的 mineru extra = mineru[core],含 torch 等模型依赖)。
90
+ # 不要用 `--with mineru`——裸 mineru 不带 [core],装出来的工具跑 ingest 会
91
+ # ModuleNotFoundError: No module named 'torch'。
92
+ # 用 --reinstall 而非 --force:版本号没变时 --force 会命中 uv 缓存里的旧 wheel,
93
+ # 把过时代码装进去(实测会停在旧版本);--reinstall 强制重建,更新才可靠。
94
+ UV_LINK_MODE=copy uv tool install --reinstall "/path/to/contract-archive-cli[mineru]"
95
+ ```
96
+
97
+ `uv tool install` 会在 `~/.local/bin/contract-archive` 装独立 venv(与项目 venv 隔离)。
98
+ 然后从任意目录:
99
+
100
+ ```bash
101
+ # 用环境变量指定档案库
102
+ CONTRACT_ARCHIVE_DIR=~/contracts contract-archive list
103
+
104
+ # 或显式 --archive(per-command 选项,放在子命令之后)
105
+ contract-archive list --archive ~/contracts
106
+ contract-archive ingest ~/Documents/new_contract.pdf --archive ~/contracts
107
+ ```
108
+
109
+ `DASHSCOPE_API_KEY` 需通过 shell env 提供(建议放进 `~/.zshrc` 或专用 shell wrapper)。
110
+
111
+ **开发者(改了源码要即时生效)**:加 `--editable`,全局命令指向本仓库源码而非快照——
112
+ 改完 `.py` 直接生效,不必每次重装:
113
+
114
+ ```bash
115
+ UV_LINK_MODE=copy uv tool install --editable --reinstall "/path/to/contract-archive-cli[mineru]"
116
+ ```
117
+
118
+ > 不加 `--editable` 装的是「当下代码的快照」:之后改了源码、或仓库升了版本,都得
119
+ > 重新 `--reinstall` 才更新。如果发现 `contract-archive --version` 跟仓库 `pyproject.toml`
120
+ > 对不上,多半就是装了旧快照——重装即可。
121
+
122
+ 卸载:
123
+
124
+ ```bash
125
+ uv tool uninstall contract-archive-cli
126
+ # 数据/配置不随之删除,需手动清理:
127
+ # ~/.local/share/contract-archive 档案库数据(db.sqlite + documents/)
128
+ # ~/.config/contract-archive config.json
129
+ ```
130
+
131
+ ## ✦ 配置
132
+
133
+ 两种方式,优先级 **环境变量(含 .env) > config 文件 > 默认值**:
134
+
135
+ ```bash
136
+ # 方式一:config 命令(落 ~/.config/contract-archive/config.json,权限 0600,比项目 .env 更安全)
137
+ contract-archive config set dashscope.api_key sk-xxx
138
+ contract-archive config show # 看各项当前生效值与来源(secret 默认掩码)
139
+ contract-archive config show --format json # 机读:含 key/env/secret/default/value/source
140
+ contract-archive config unset dashscope.api_key
141
+
142
+ # 方式二:项目 .env(老方式,仍支持)
143
+ cp .env.example .env
144
+ $EDITOR .env # 填入 DASHSCOPE_API_KEY
145
+ ```
146
+
147
+ | 环境变量 | config 键 | 说明 |
148
+ | --- | --- | --- |
149
+ | `DASHSCOPE_API_KEY` | `dashscope.api_key` | 必填。[百炼控制台](https://dashscope.console.aliyun.com/) 申请 |
150
+ | `DASHSCOPE_LLM_MODEL` | `dashscope.model` | 默认 `qwen3.7-max`(用户百炼账户的特定别名;若 404 换 `qwen-max` / `qwen3-max`) |
151
+ | `DASHSCOPE_BASE_URL` | `dashscope.base_url` | 默认 `https://dashscope.aliyuncs.com/api/v1`;海外换 `https://dashscope-intl.aliyuncs.com/api/v1` |
152
+ | `DASHSCOPE_VL_MODEL` | `dashscope.vl_model` | 签章核查视觉模型,默认 `qwen3.6-flash` |
153
+ | `CONTRACT_ARCHIVE_DIR` | `archive.dir` | 档案库根目录,默认 XDG `~/.local/share/contract-archive`;CLI `--archive` 优先 |
154
+ | `COMPUTE_DEVICE` | — | `auto` / `mps` / `cuda` / `cpu`(MinerU 走子进程,主要影响其内部 backend 选择) |
155
+ | `LOG_LEVEL` | — | `DEBUG`/`INFO`/`WARNING`/...,默认 `INFO`;`--verbose`/`--quiet` 覆盖之 |
156
+ | `DASHSCOPE_TIMEOUT_S` | — | LLM/VL 调用超时秒数,默认 `300` |
157
+ | `CONTRACT_ARCHIVE_MINERU_TIMEOUT_S` | — | MinerU 子进程解析超时秒数,默认 `1800` |
158
+ | `MINERU_MODEL_SOURCE` | — | MinerU 模型源,默认 `modelscope`(国内快);海外可 export `huggingface` |
159
+
160
+ > 标 `—` 的是运行时旋钮,保持 env-only、不进 config 文件层。
161
+
162
+ ## ✦ 用法
163
+
164
+ ```bash
165
+ # 入库单个 PDF
166
+ uv run contract-archive ingest path/to/合同.pdf
167
+
168
+ # 批量入库整个目录(递归扫 *.pdf,sha256 去重)
169
+ uv run contract-archive ingest ~/Documents/contracts/
170
+
171
+ # 跳过 LLM(无 API key 时也用):仅入库 MinerU 产物,抽取字段留空,可后续 extract 补抽
172
+ uv run contract-archive ingest path/to/合同.pdf --no-llm
173
+
174
+ # 强制重跑(已 ingest 过的也再跑一遍,覆盖旧记录)
175
+ uv run contract-archive ingest path/to/合同.pdf --reingest
176
+
177
+ # 试跑前 3 个
178
+ uv run contract-archive ingest ~/Documents/contracts/ --limit 3
179
+
180
+ # 成本/进度(agent 友好)
181
+ uv run contract-archive ingest ~/Documents/contracts/ --dry-run # 只预览扫到几个、预计几次 API 调用,不建库不烧钱
182
+ uv run contract-archive ingest ~/Documents/contracts/ --max-files 20 # 超 20 个直接报错退出,防误喂大目录
183
+ uv run contract-archive ingest ~/Documents/contracts/ --progress ndjson # 每文件一行 JSON 事件,供 agent 流式消费
184
+ ```
185
+
186
+ ### 查询
187
+
188
+ ```bash
189
+ # 列出全部(按入库时间倒序,默认 50 条)
190
+ uv run contract-archive list
191
+
192
+ # 按签订日排序,只看 partial 的
193
+ uv run contract-archive list --order-by sign_date --status partial
194
+
195
+ # 输出 JSON 供脚本消费
196
+ uv run contract-archive list --format json | jq '.[] | .contract_name'
197
+
198
+ # 多字段过滤(全部 AND)
199
+ uv run contract-archive search --party 张三 --amount-min 100000 --signed-after 2024-01-01
200
+ uv run contract-archive search --expire-before 2026-12-31 --has-risk
201
+ uv run contract-archive search --name 车位 --auto-renewal
202
+
203
+ # 看单条详情(id 或 sha 前缀 ≥4 字符)
204
+ uv run contract-archive show 5
205
+ uv run contract-archive show a3f9c2b1
206
+
207
+ # 看原文:show 看 LLM 抽出的字段,raw 看抽取依据的 OCR 原始文本(同一份喂给 LLM 的内容)
208
+ # 交互终端下按抽取来源给命中关键字着色(当事人/金额/日期/风险/字段),一眼看出哪些被识别到
209
+ uv run contract-archive raw 5
210
+ uv run contract-archive raw a3f9c2b1 | grep 违约 # 管道时自动纯文本,不破坏 grep
211
+ uv run contract-archive raw 5 --color always | less -R # 强制上色配 less -R
212
+ ```
213
+
214
+ ### 待办看板(义务清单)
215
+
216
+ 每份合同抽取时会拆出双方"动作"(递交资料/付款/交付/签字等)作为
217
+ 独立的 `obligations` 表,每条带 `actor` (甲方/乙方/双方) + `deadline`:
218
+
219
+ ```bash
220
+ # 跨合同列出所有待办(按 deadline 升序,NULL 排最后)
221
+ contract-archive todo --include-undated
222
+
223
+ # 未来 30 天内要做的事
224
+ contract-archive todo --within-days 30
225
+
226
+ # 只看甲方任务 / 只看乙方任务
227
+ contract-archive todo --actor party_a
228
+ contract-archive todo --actor party_b --before 2026-12-31
229
+
230
+ # 找"近 30 天内有截止动作的合同"(不是单条 obligation,而是合同列)
231
+ contract-archive search --deadline-before 2026-06-30 --actor party_b
232
+ ```
233
+
234
+ `contract-archive show <id>` 会按甲方/乙方/双方分组展示该合同所有义务,
235
+ 与原本的 `risk_clauses`(违约罚则)严格区分。
236
+
237
+ ### 身份核对(known_parties 基准库)
238
+
239
+ 抽取时把每个主体(自然人/机构)与其固有标识(身份证号/电话/银行账号/开户行/税号…)
240
+ **精确绑定到人**(`person_identities`),不像扁平字段那样把多人号码混成一条。
241
+ 入库时与 `known_parties` 基准库比对,采用「首见入库、再见校对」:
242
+
243
+ - **首见**:某主体的某标识第一次出现 → 录入为基准(记首见出处)。
244
+ - **再见**:同主体同标识再出现 → 与基准比对,不一致即在 `show` 的「身份核对」块报
245
+ `identity` 缺陷(疑似 OCR 读错或信息被改),**不覆盖基准**。
246
+ - 比较前归一化剥离分隔符噪声(空格/;/:不误报),但多/少/错位的真实数字差异会被抓出。
247
+ - 不分自然人/机构——身份证、电话、银行账号、开户行一律核对。
248
+
249
+ ```bash
250
+ contract-archive party list # 列出所有已知主体及标识
251
+ contract-archive party show 张三 # 查看某主体的标识基准
252
+ contract-archive party set 张三 身份证号 1101... # 手动修正基准(纠正被 OCR 读错的首见值)
253
+ contract-archive party rm 张三 电话 # 删除某标识;省略标识则删整个主体
254
+ ```
255
+
256
+ > 基准库 `known_parties.json` 存档案库根目录,**含真实 PII**(身份证/电话/账号),
257
+ > 文件权限 0600、列入 `.gitignore`,绝不入库或分享。
258
+
259
+ ### 抽取层管理
260
+
261
+ LLM 跑挂或想升级 prompt 后批量再抽取——不重跑 MinerU:
262
+
263
+ ```bash
264
+ uv run contract-archive extract 5 # 复跑 id=5 的抽取
265
+ uv run contract-archive extract 5 --no-llm # 跳过 LLM(抽取字段留空,rule 已退役)
266
+ ```
267
+
268
+ ### 统计与维护
269
+
270
+ ```bash
271
+ uv run contract-archive stats # 总数 / status 分布 / 按月签订 / 近 30 天到期
272
+ uv run contract-archive delete 5 # 默认仅删 DB 行,交互确认
273
+ uv run contract-archive delete 5 --purge-files -y # 同时删 archive/documents/<sha>/,无确认
274
+ uv run contract-archive vacuum # 大批量 ingest 后整理碎片
275
+ ```
276
+
277
+ > **注意**:`delete` 不会删用户原 PDF 文件——`source_path` 字段记录的是入库时
278
+ > 的源路径,源文件归用户所有。
279
+
280
+ ### 印章总览
281
+
282
+ ```bash
283
+ uv run contract-archive seals # 跨文档列全部印章
284
+ uv run contract-archive seals --seal-owner 示例公司 # 某主体的章(--owner 同义)
285
+ uv run contract-archive seals --seal-type 合同专用章 # 按印章类型(--type 同义)
286
+ ```
287
+
288
+ ### 机器发现 / agent 接入
289
+
290
+ 把本 CLI 包成 MCP / OpenAI tool,或让 agent 自动调用时,用这几个命令免去硬编码——输出皆 JSON:
291
+
292
+ ```bash
293
+ uv run contract-archive capabilities # 全部命令 + 副作用/破坏性/幂等元数据
294
+ uv run contract-archive describe ingest # 单命令参数 schema(名称/类型/必填/默认/可选值)
295
+ uv run contract-archive schema document # 核心数据结构 JSON Schema(document/contract/confidence/error)
296
+ ```
297
+
298
+ 数据命令(list/search/show/stats/todo/seals/party/extract/ingest)都支持 `--format json`,
299
+ stdout 纯净可 `| jq`;失败结果带结构化 `error`(`code`/`category`/`retryable`),供 agent 判是否重试。
300
+
301
+ ## ✦ 档案库目录结构
302
+
303
+ ```
304
+ archive/
305
+ ├── db.sqlite # 索引表
306
+ ├── db.sqlite-wal / -shm # WAL 模式产物(运行时)
307
+ ├── ingest.jsonl # 总日志(每次 ingest 一行 JSON)
308
+ └── documents/
309
+ └── a3f9c2b1/ # sha256 前 12 位
310
+ ├── source.pdf # 硬链接源 PDF(跨盘 fallback copy)
311
+ ├── mineru/
312
+ │ ├── markdown.md
313
+ │ ├── layout.json # bbox 已归一到 PDF point
314
+ │ ├── structured.json
315
+ │ ├── raw_text.txt
316
+ │ ├── pipeline_meta.json
317
+ │ └── preview_images/
318
+ ├── extraction_result.json # 抽取字段(通用信封)
319
+ ├── extraction_confidence.json
320
+ └── ingest.log # 单合同 stderr
321
+ ```
322
+
323
+ ## ✦ Docker
324
+
325
+ ```bash
326
+ docker build -t contract-archive -f docker/Dockerfile .
327
+ docker run --rm -it \
328
+ -v $PWD/archive:/app/archive \
329
+ -v $PWD/input:/app/input \
330
+ -v ~/.cache/modelscope:/root/.cache/modelscope \
331
+ --env-file .env \
332
+ contract-archive uv run contract-archive ingest /app/input
333
+ ```
334
+
335
+ 挂载 modelscope 缓存复用本机 MinerU 模型。Mac 容器不直通 GPU,强烈推荐 native venv 跑。
336
+
337
+ ## ✦ 项目结构
338
+
339
+ ```
340
+ contract-archive-cli/
341
+ ├── pyproject.toml # uv 依赖管理(extras: mineru)
342
+ ├── docker/Dockerfile
343
+ ├── .env.example
344
+ ├── scripts/
345
+ │ └── setup.sh
346
+ ├── contract_archive/
347
+ │ ├── cli.py # 入口 main_entry + 写命令 ingest/extract/delete/vacuum + 组装
348
+ │ ├── cli_common.py # app 实例 + 全局 callback + 参数 Enum + 双 console + 路径/ident 解析
349
+ │ ├── cli_query.py # 只读命令 list/search/show/raw/stats/todo/seals
350
+ │ ├── cli_config.py # config show/set/unset 子命令组
351
+ │ ├── cli_party.py # party list/show/set/rm(known_parties PII 基准库)
352
+ │ ├── cli_introspect.py # capabilities/describe/schema 机器发现命令
353
+ │ ├── cli_render.py # 纯渲染层(Table / JSON dict / raw 高亮)
354
+ │ ├── schemas/ # pydantic schema(BBox/LayoutBlock/DocumentExtraction 等)
355
+ │ ├── pipelines/
356
+ │ │ └── mineru_pipeline.py # MinerU subprocess 调用 + 坐标归一化 + markdown 清洗
357
+ │ ├── extraction/ # 纯 LLM 抽取(rule/hybrid 自 Phase 2 退役)
358
+ │ │ ├── document_extractor.py # 通用文档判类型 + 抽信封
359
+ │ │ ├── contract_extractor.py # 合同专属字段(专属 prompt)
360
+ │ │ ├── llm_extractor.py # DashScope OpenAI 兼容口调用
361
+ │ │ ├── vision_seal.py # 落款页 VL 签章核查
362
+ │ │ ├── normalize.py / amount_check.py / evidence_page_fix.py / property_fee.py
363
+ │ ├── archive/
364
+ │ │ ├── db.py # SQLite 连接 + migrations 引擎
365
+ │ │ ├── repository.py # DAO + 搜索查询构造
366
+ │ │ ├── ingest.py # 入库流水线(hash → MinerU → extract → rename → DB)
367
+ │ │ ├── party_registry.py # known_parties 身份基准库
368
+ │ │ ├── paths.py # 档案库路径约定 + 硬链接工具
369
+ │ │ └── migrations/ # 001_init … 005_completeness(5 个)
370
+ │ ├── errors.py # 结构化错误模型(code/category/retryable)
371
+ │ ├── config.py # XDG 配置 + env>file>default
372
+ │ └── utils/ # 设备选择 / PyMuPDF PDF 渲染
373
+ ├── archive/ # 档案库数据(gitignored)
374
+ ├── input/ # 用户放待处理 PDF
375
+ └── output.legacy/ # 旧 pipeline 历史产物(重构前的对比数据,可删)
376
+ ```
377
+
378
+ ## ✦ 设计纪律
379
+
380
+ - **统一 schema**:MinerU 的 0-1000 归一化坐标全部反算成 PDF point;markdown 反斜杠转义在喂给抽取层前清洗
381
+ - **纯 LLM 抽取**:自 Phase 2 退役 rule/hybrid,字段全由 LLM 抽取;每字段附 `value_source`(仅 `llm`/`missing`)+ 置信度。死代码 rule 仅保留为确定性数值归一化(中文大写金额→数值、日期→ISO)
382
+ - **API key 不出包**:仅从 env 读,日志不打印响应体
383
+ - **sha256 去重**:流式 hash 后查 UNIQUE 索引;命中即 skip 避免 MinerU 跑一次几分钟才发现重复
384
+ - **事务边界**:tmp 目录跑全 → `os.rename` 到 documents/ → DB INSERT;任一阶段失败回滚干净,DB 不留半成品
385
+ - **partial 状态可修复**:MinerU OK 但 LLM 挂时 markdown 仍可用,`extract <id>` 命令只重跑抽取层
386
+ - **不并发 MinerU**:每个 subprocess 会加载 GB 级模型,并发反而 OOM;默认 workers=1
@@ -0,0 +1,44 @@
1
+ contract_archive/__init__.py,sha256=N9b0h5EVwXMhVp10Ayyaz9oB0A9tWQWBK1ydtwxvA74,113
2
+ contract_archive/cli.py,sha256=azTCcWNvScmE7m1F4wmF0ZvI76xYLT8Z_QVeOd7Ymj4,18376
3
+ contract_archive/cli_common.py,sha256=mOAVoGJmZdv2JWORGsS7oJ4PPeaKhn86kJ-Kt7CzcVw,10417
4
+ contract_archive/cli_config.py,sha256=KKIITBQ1qlwoxgMDtIul0Rc7SJYl-9-k21fQx08dPOw,3855
5
+ contract_archive/cli_introspect.py,sha256=fJspKQN5LcIzkDP7FTlAus5j9dvKnN_aVL99V7ZAkiQ,9475
6
+ contract_archive/cli_party.py,sha256=EzzNiI6KggqYNKlaOhjzicTV3wHRmPD_OLaMmZT6MeY,7582
7
+ contract_archive/cli_query.py,sha256=d9PcmisRfeAa91HEm4rfWbvbFiReFoNZv96FNt9Hr0w,17248
8
+ contract_archive/cli_render.py,sha256=zEP7ql3sPfp8YGJ5CLcQKrtWVBSLQUyFEkRJpMX3f5c,23530
9
+ contract_archive/config.py,sha256=l1ljfo57Uym-8jQXxOblJugbqGtAEOYZYOTUp2Ym6pw,10988
10
+ contract_archive/errors.py,sha256=bGkBlNFicM7G0o8yCrYYRzH1I_pzRhZhCM2MuPOUAPM,6981
11
+ contract_archive/archive/__init__.py,sha256=1jST49wGbPKC59Jzd4D4H1GCwLjzLm4W8YAiZGklxqc,1561
12
+ contract_archive/archive/db.py,sha256=Hk1X4iOhYcvwJlA8MEvxVqDgnO0s_vExjAek-9s6Y8I,4338
13
+ contract_archive/archive/ingest.py,sha256=nYgyRFMfFzO6AxK4c1kQRY13B86oJ6hAbcXbF39vmtA,25729
14
+ contract_archive/archive/party_registry.py,sha256=PEuVuFafaQK8jT__Aq9GEYRuQR4ehwK7UpMzVQQbxX0,13432
15
+ contract_archive/archive/paths.py,sha256=UB8sET1NDs_jWR2XLlyhxYWbwXB9e7nmdi-HbuFmo-I,3637
16
+ contract_archive/archive/repository.py,sha256=3TwsAe3A9pNr4WMnZOlFg9dfFVZdqXmUtXbMX9If53Y,31323
17
+ contract_archive/archive/migrations/001_init.sql,sha256=aQYV7uFygcISZaOxv-ZGTwSOSHebcfe5mLUyb7dYz8I,3036
18
+ contract_archive/archive/migrations/002_obligations.sql,sha256=ZaKaVSWIAIxJAsM4MCJcXip8YLZYpjffDkvEfI4vGHU,1187
19
+ contract_archive/archive/migrations/003_document_types.sql,sha256=1laoz1dLg14vzyZvLGt_QPamWgGHb2cVo_HID5R-9u4,2117
20
+ contract_archive/archive/migrations/004_seals_subjects.sql,sha256=DmF1Pg_u_E8wbkx2B0fE2kl69-VqKJGcbykMZqhWUUg,2164
21
+ contract_archive/archive/migrations/005_completeness.sql,sha256=4Czsjdr3ttfjAWyUPkIaDLpg_vN2SQSeJGbe_uAjOiw,1228
22
+ contract_archive/extraction/__init__.py,sha256=oh4rj1vnH6CvZvb0_dz-P9mrTekxEq3HYMGDtWbMMlI,423
23
+ contract_archive/extraction/amount_check.py,sha256=G-BYm_WN9VjaPBXEVe-u5LsGpRebPZLMz0B27UrOYu0,4087
24
+ contract_archive/extraction/contract_extractor.py,sha256=iJhI-3m0tdDd6Ne_UQ4PnujIizN74qlD16-JphGvX1M,3859
25
+ contract_archive/extraction/document_extractor.py,sha256=dld3mCYUVLOtc0rBx1vVKRrEFMYgs8KKWytRMlP3xEY,33051
26
+ contract_archive/extraction/evidence_page_fix.py,sha256=QqOSSWuk_H5SMc3MfjgQKLLvcXp2QPPOlIdPjsAf6DM,4093
27
+ contract_archive/extraction/llm_extractor.py,sha256=Rfr6fe6K7dkuN1mlKLc908ZDt8mmzuxIJmet659evFc,9463
28
+ contract_archive/extraction/normalize.py,sha256=ep3-ZeqG6PZo3rZFAspYsk9GyYLyyi7x5LhIdDbHCjc,7255
29
+ contract_archive/extraction/property_fee.py,sha256=NSi18i7EdERkOfdJvAWwEvRfNDd7PG14wTT26SN2Xz0,3599
30
+ contract_archive/extraction/vision_seal.py,sha256=e_N-Zd5O5Kq0ZJrlyGk6cYT_nQ5AKr0DSQCqIL2CUQg,18006
31
+ contract_archive/pipelines/__init__.py,sha256=c-rjsQs11CcmQbyZgXIhREnluO8jjQJJXkACf-JlkiY,273
32
+ contract_archive/pipelines/mineru_pipeline.py,sha256=B3aZBSOCsHgZdEsnMjLB7lUsQ5G9580Q9fwiuIF1RGE,33061
33
+ contract_archive/pipelines/vl_ocr.py,sha256=Uu5EnJk5Hq8B-mgiQ6ydOWirU9kY9_KXwOdCxjeE1MM,6675
34
+ contract_archive/schemas/__init__.py,sha256=5AK6fWlHWbMdOIWI5hp93Wy03HO_zmMHbjagQ4HiSe8,1275
35
+ contract_archive/schemas/document.py,sha256=ujKsOxhZJtF19iIQY6Ti2qUKFhjPxckAlkPX2jGSAec,18737
36
+ contract_archive/utils/__init__.py,sha256=T9s-MpY3BNQ0JYTNSP6yJfbCySqyA-BRW8XlsUV40L8,561
37
+ contract_archive/utils/device.py,sha256=IhI_MMm3UjE0NPUlLCrv6n0MuRt3QoolpZLeMCCmw-w,1463
38
+ contract_archive/utils/http_env.py,sha256=3FJRmAvf6FyevpfEazveC_5VQ3qCcj1vWT6_lMW7XPs,1685
39
+ contract_archive/utils/pdf.py,sha256=wbMs3ZYO1D5XNntcR0pqkvaAI0IKa87UZNsLi6gjzaA,6503
40
+ contract_archive_cli-0.2.7.dist-info/METADATA,sha256=JXKrrOpJzIjVKlotSUP7qhgkNOKEVbPPD1-Affae2UE,18855
41
+ contract_archive_cli-0.2.7.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
42
+ contract_archive_cli-0.2.7.dist-info/entry_points.txt,sha256=FaCRhque-fIEWQ3yHAVEoceroCL0ZB2vPKfxyI5L0M4,69
43
+ contract_archive_cli-0.2.7.dist-info/licenses/LICENSE,sha256=scC-b_caxbJBCkr0ioJkMF_8fTKUNDEcR463CD0TWhQ,1062
44
+ contract_archive_cli-0.2.7.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.30.1
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ contract-archive = contract_archive.cli:main_entry
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 crhan
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.