scrivai 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
scrivai-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 chengzhang yu
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
scrivai-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,498 @@
1
+ Metadata-Version: 2.4
2
+ Name: scrivai
3
+ Version: 0.1.0
4
+ Summary: 可配置的通用文档生成与审核框架
5
+ Author-email: chengzhang yu <iomgaaycz@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/iomgaa/scrivai
8
+ Project-URL: Repository, https://github.com/iomgaa/scrivai
9
+ Project-URL: Issues, https://github.com/iomgaa/scrivai/issues
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
16
+ Classifier: Topic :: Text Processing :: Linguistic
17
+ Requires-Python: >=3.11
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Requires-Dist: litellm>=1.0
21
+ Requires-Dist: jinja2>=3.1
22
+ Requires-Dist: pyyaml>=6.0
23
+ Requires-Dist: python-dotenv>=1.0
24
+ Requires-Dist: qmd>=0.1.0
25
+ Provides-Extra: dev
26
+ Requires-Dist: ruff>=0.4; extra == "dev"
27
+ Requires-Dist: pytest>=8.0; extra == "dev"
28
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
29
+ Dynamic: license-file
30
+
31
+ # Scrivai
32
+
33
+ **可配置的通用文档生成与审核框架**
34
+
35
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
36
+ [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
37
+
38
+ ## 概述
39
+
40
+ Scrivai 是一个 Python SDK,面向长文档自动化生成与审核场景。核心设计理念:
41
+
42
+ - **库优先**:作为工具库交付,用户 `import scrivai` 直接使用
43
+ - **原子化**:每个组件独立可用,不强制组合
44
+ - **可配置**:通过 YAML 配置文件接入不同项目,无需修改框架代码
45
+ - **MVP**:聚焦核心功能,避免过度工程
46
+
47
+ ### 核心能力
48
+
49
+ | 能力 | 描述 |
50
+ |------|------|
51
+ | **文档生成** | 基于用户输入 + 历史案例库,按章节模板生成长文档,保障全文连贯 |
52
+ | **文档审核** | 基于规章制度/标准,逐要点审核文档合规性,输出结构化报告 |
53
+ | **知识检索** | 基于 qmd 的语义检索,统一管理案例/规则/模板 |
54
+ | **文档预处理** | PDF → Markdown 的 OCR 转换与清洗(旁路工具) |
55
+
56
+ ## 系统架构
57
+
58
+ ```
59
+ ┌─────────────────────────────┐
60
+ │ Project (入口) │ ← 极简配置加载 + 组件组装
61
+ │ .llm / .store / .gen / │
62
+ │ .ctx / .audit │
63
+ └──────────────┬──────────────┘
64
+
65
+ ┌──────┴──────┐
66
+ ▼ ▼
67
+ ┌───────┴────┐ ┌────┴────────┐
68
+ │ 生成引擎 │ │ 审核引擎 │
69
+ │ Generation │ │ Audit │
70
+ └───────┬────┘ └────┬────────┘
71
+ │ │
72
+ ┌───────┴───┐ ┌────┴───────┐
73
+ │ 上下文工具 │ │ │
74
+ │ Context │ │ │
75
+ └───────┬───┘ └────────────┘
76
+
77
+ ┌───────┴─────────────────────┐
78
+ │ LLM 调用层 │ ← litellm(支持多 Provider)
79
+ │ LLM Client │
80
+ └──────────────┬──────────────┘
81
+
82
+ ┌──────────────┴──────────────┐
83
+ │ 知识库 │ ← qmd 语义检索
84
+ │ Knowledge Store │
85
+ └─────────────────────────────┘
86
+ ```
87
+
88
+ ## 安装
89
+
90
+ ### 环境要求
91
+
92
+ - Python >= 3.11
93
+ - Conda(推荐)
94
+
95
+ ### 安装步骤
96
+
97
+ ```bash
98
+ # 克隆仓库
99
+ git clone https://github.com/your-org/scrivai.git
100
+ cd scrivai
101
+
102
+ # 创建 Conda 环境
103
+ conda create -n scrivai python=3.11
104
+ conda activate scrivai
105
+
106
+ # 安装依赖
107
+ pip install -e .
108
+
109
+ # 开发依赖
110
+ pip install -e ".[dev]"
111
+ ```
112
+
113
+ ### 配置 API Key
114
+
115
+ 创建 `.env` 文件:
116
+
117
+ ```bash
118
+ LLM_API_KEY=your_api_key_here
119
+ ```
120
+
121
+ ## 快速开始
122
+
123
+ ### 1. 创建项目配置
124
+
125
+ ```yaml
126
+ # my-project.yaml
127
+ llm:
128
+ model: "deepseek/deepseek-chat"
129
+ temperature: 0.7
130
+ max_tokens: 4096
131
+
132
+ knowledge:
133
+ db_path: "data/my-project.db"
134
+ namespace: "my-project"
135
+ ```
136
+
137
+ ### 2. 使用 SDK
138
+
139
+ ```python
140
+ import scrivai
141
+
142
+ # 初始化项目
143
+ project = scrivai.Project("my-project.yaml")
144
+
145
+ # === 文档生成 ===
146
+ # 准备模板和变量
147
+ template = """
148
+ ## 工程概况
149
+
150
+ 请根据以下信息撰写本章:
151
+
152
+ ### 用户输入
153
+ {{ user_inputs | tojson }}
154
+
155
+ ### 相关历史案例
156
+ {% for case in retrieved_cases %}
157
+ --- 案例 {{ loop.index }} ---
158
+ {{ case.content }}
159
+ {% endfor %}
160
+
161
+ ### 前文摘要
162
+ {{ previous_summary }}
163
+
164
+ ### 术语表
165
+ {{ glossary | tojson }}
166
+ """
167
+
168
+ # 检索案例
169
+ cases = project.gen.retrieve_cases("变电站施工方案概况", top_k=3)
170
+
171
+ # 生成章节
172
+ chapter = project.gen.generate_chapter(template, {
173
+ "user_inputs": {"工程名称": "XX变电站", "地点": "广东省"},
174
+ "retrieved_cases": cases,
175
+ "previous_summary": "",
176
+ "glossary": {}
177
+ })
178
+
179
+ # === 长文档生成(多章节连贯)===
180
+ glossary, summary = {}, ""
181
+ for ch in chapters:
182
+ cases = project.store.search(ch["topic"], top_k=3) if project.store else []
183
+ text = project.gen.generate_chapter(ch["template"], {
184
+ "user_inputs": inputs,
185
+ "retrieved_cases": cases,
186
+ "previous_summary": summary,
187
+ "glossary": glossary
188
+ })
189
+ summary = project.ctx.summarize(text)
190
+ glossary = project.ctx.extract_terms(text, glossary)
191
+
192
+ # === 文档审核 ===
193
+ # 定义审核要点
194
+ checkpoints = [
195
+ {
196
+ "id": "CP001",
197
+ "description": "检查工程概况章节完整性",
198
+ "severity": "error",
199
+ "scope": "chapter:工程概况",
200
+ "prompt_template": "检查该章节是否包含工程名称、地点、规模等必要信息",
201
+ "rule_refs": [{"query": "施工方案编制要求"}]
202
+ }
203
+ ]
204
+
205
+ # 执行审核
206
+ results = project.audit.check_many(document_text, checkpoints)
207
+ for r in results:
208
+ print(f"[{'✓' if r.passed else '✗'}] {r.checkpoint_id}: {r.finding}")
209
+ ```
210
+
211
+ ## 核心 API
212
+
213
+ ### Project
214
+
215
+ 统一入口,配置加载 + 组件组装。
216
+
217
+ ```python
218
+ project = scrivai.Project("config.yaml")
219
+
220
+ project.llm # LLMClient — LLM 调用客户端
221
+ project.store # KnowledgeStore | None — 知识库实例
222
+ project.gen # GenerationEngine — 章节生成引擎
223
+ project.ctx # GenerationContext — 上下文工具
224
+ project.audit # AuditEngine — 文档审核引擎
225
+ ```
226
+
227
+ ### GenerationEngine
228
+
229
+ 单章生成(原子操作),多章编排由调用方负责。
230
+
231
+ ```python
232
+ # 生成章节
233
+ text = project.gen.generate_chapter(template, variables)
234
+
235
+ # 检索案例(便捷方法)
236
+ cases = project.gen.retrieve_cases(query, top_k=5, filters={"type": "case"})
237
+ ```
238
+
239
+ **模板变量**:
240
+
241
+ | 变量 | 类型 | 说明 |
242
+ |------|------|------|
243
+ | `user_inputs` | dict | 用户输入的变量 |
244
+ | `retrieved_cases` | list[SearchResult] | RAG 检索结果 |
245
+ | `previous_summary` | str | 前文摘要 |
246
+ | `glossary` | dict[str, str] | 术语表 |
247
+
248
+ ### GenerationContext
249
+
250
+ 上下文工具,保障长文档连贯性。
251
+
252
+ ```python
253
+ # 生成前文摘要(压缩上下文)
254
+ summary = project.ctx.summarize(text)
255
+
256
+ # 提取术语并合并到术语表
257
+ glossary = project.ctx.extract_terms(text, existing_glossary)
258
+
259
+ # 提取交叉引用
260
+ refs = project.ctx.extract_references(text)
261
+ ```
262
+
263
+ ### AuditEngine
264
+
265
+ 四维审核:结构合规、引用有效性、语义合规、内部一致性。
266
+
267
+ ```python
268
+ # 单要点审核
269
+ result = project.audit.check_one(document, checkpoint)
270
+
271
+ # 批量审核
272
+ results = project.audit.check_many(document, checkpoints)
273
+
274
+ # 从 YAML 加载审核要点
275
+ checkpoints = project.audit.load_checkpoints("checkpoints.yaml")
276
+ ```
277
+
278
+ **AuditResult**:
279
+
280
+ ```python
281
+ @dataclass
282
+ class AuditResult:
283
+ passed: bool # 是否通过
284
+ severity: str # "error" | "warning" | "info"
285
+ checkpoint_id: str # 审核要点标识
286
+ chapter_id: str | None # 章节标识
287
+ finding: str # 审核发现
288
+ evidence: str # 支撑证据
289
+ suggestion: str # 修改建议
290
+ ```
291
+
292
+ **Checkpoint 配置**:
293
+
294
+ ```yaml
295
+ checkpoints:
296
+ - id: "CP001"
297
+ description: "检查工程概况完整性"
298
+ severity: "error"
299
+ scope: "chapter:工程概况" # "full" | "chapter:xxx"
300
+ prompt_template: "检查是否包含工程名称、地点、规模"
301
+ rule_refs: # 支撑条文
302
+ - source: "GB50150"
303
+ clause_id: "3.2.1"
304
+ - query: "施工方案编制要求" # 语义查询
305
+ ```
306
+
307
+ ### KnowledgeStore
308
+
309
+ 基于 qmd 的统一知识库,通过 `metadata["type"]` 区分案例/规则。
310
+
311
+ ```python
312
+ # 入库
313
+ store.add(
314
+ texts=["案例内容..."],
315
+ metadatas=[{"type": "case", "source": "doc.pdf"}]
316
+ )
317
+
318
+ # 批量导入
319
+ store.add_from_directory(
320
+ path="cases/",
321
+ pattern="*.md",
322
+ metadata={"type": "case"}
323
+ )
324
+
325
+ # 语义检索
326
+ results = store.search(query, top_k=5, filters={"type": "rule"})
327
+
328
+ # 统计与删除
329
+ count = store.count(filters={"type": "case"})
330
+ deleted = store.delete(filters={"source": "old_doc.pdf"})
331
+ ```
332
+
333
+ ### DocPipeline(旁路工具)
334
+
335
+ PDF → Markdown 转换与清洗。
336
+
337
+ ```python
338
+ from utils.doc_pipeline import DoclingAdapter, MonkeyOCRAdapter, MarkdownCleaner, DocPipeline
339
+
340
+ # Docling(本地,无需服务)
341
+ pipeline = DocPipeline(DoclingAdapter(), MarkdownCleaner())
342
+ result = pipeline.run("document.pdf")
343
+
344
+ # MonkeyOCR + LLM 清洗
345
+ pipeline = DocPipeline(
346
+ MonkeyOCRAdapter("http://localhost:8080"),
347
+ MarkdownCleaner(llm=project.llm)
348
+ )
349
+ result = pipeline.run("document.pdf")
350
+
351
+ # 结果
352
+ print(result.raw_md) # OCR 原始输出
353
+ print(result.cleaned_md) # 清洗后输出
354
+ print(result.warnings) # 验证警告
355
+ ```
356
+
357
+ ## 配置参考
358
+
359
+ ### 完整配置示例
360
+
361
+ ```yaml
362
+ # LLM 配置(必须)
363
+ llm:
364
+ model: "deepseek/deepseek-chat" # litellm 模型标识
365
+ temperature: 0.7
366
+ max_tokens: 4096
367
+ api_base: null # 自定义 API 端点(可选)
368
+ # api_key 从 .env 读取(LLM_API_KEY)
369
+
370
+ # 知识库配置(可选,设为 null 禁用)
371
+ knowledge:
372
+ db_path: "data/scrivai.db"
373
+ namespace: "default"
374
+
375
+ # 生成引擎配置(可选)
376
+ generation:
377
+ templates_dir: "templates/chapters"
378
+
379
+ # 审核引擎配置(可选)
380
+ audit:
381
+ checkpoints_path: "config/checkpoints.yaml"
382
+ ```
383
+
384
+ ### 环境变量
385
+
386
+ | 变量 | 说明 |
387
+ |------|------|
388
+ | `LLM_API_KEY` | API 密钥(优先) |
389
+ | `API_KEY` | API 密钥(备选) |
390
+
391
+ ## 开发指南
392
+
393
+ ### 代码质量
394
+
395
+ ```bash
396
+ # Lint
397
+ ruff check . --fix
398
+
399
+ # Format
400
+ ruff format .
401
+
402
+ # 类型检查(可选)
403
+ mypy core/
404
+ ```
405
+
406
+ ### 测试
407
+
408
+ ```bash
409
+ # 单元测试
410
+ pytest tests/unit/ -v
411
+
412
+ # 集成测试(需要 API key)
413
+ pytest tests/integration/ -v
414
+
415
+ # E2E 测试
416
+ pytest tests/e2e/ -v
417
+
418
+ # 覆盖率
419
+ pytest tests/ --cov=core --cov-report=term-missing
420
+ ```
421
+
422
+ ### 测试组织
423
+
424
+ ```
425
+ tests/
426
+ ├── unit/ # 单元测试(使用 mock)
427
+ ├── integration/ # 集成测试(真实 API 调用)
428
+ └── e2e/ # 端到端测试
429
+ ```
430
+
431
+ ## 项目结构
432
+
433
+ ```
434
+ Scrivai/
435
+ ├── core/ # 核心模块
436
+ │ ├── __init__.py # 统一导出
437
+ │ ├── llm.py # LLMClient(litellm 薄封装)
438
+ │ ├── project.py # Project 入口
439
+ │ ├── chunkers.py # 文本切片工具
440
+ │ ├── knowledge/ # 知识库
441
+ │ │ ├── __init__.py
442
+ │ │ └── store.py # KnowledgeStore(qmd 封装)
443
+ │ ├── generation/ # 生成引擎
444
+ │ │ ├── __init__.py
445
+ │ │ ├── engine.py # GenerationEngine
446
+ │ │ └── context.py # GenerationContext
447
+ │ └── audit/ # 审核引擎
448
+ │ ├── __init__.py
449
+ │ └── engine.py # AuditEngine, AuditResult
450
+ ├── utils/ # 工具模块
451
+ │ └── doc_pipeline.py # OCR + 清洗管道
452
+ ├── templates/
453
+ │ └── prompts/ # Prompt 模板(j2 + md 分离)
454
+ │ ├── base.j2
455
+ │ ├── summarize.j2 / summarize.md
456
+ │ ├── extract_terms.j2 / extract_terms.md
457
+ │ ├── extract_references.j2 / extract_references.md
458
+ │ ├── audit.j2 / audit.md
459
+ │ └── clean.j2 / clean.md
460
+ ├── examples/ # 示例配置
461
+ ├── tests/ # 测试
462
+ ├── docs/ # 文档
463
+ │ ├── architecture.md # 架构设计
464
+ │ └── sdk_design.md # SDK 详细设计
465
+ ├── CLAUDE.md # 开发规范
466
+ ├── REVIEW_GUIDE.md # 代码审查指南
467
+ ├── pyproject.toml
468
+ └── README.md
469
+ ```
470
+
471
+ ## 文档
472
+
473
+ | 文档 | 说明 |
474
+ |------|------|
475
+ | [docs/architecture.md](docs/architecture.md) | 系统架构详解 |
476
+ | [docs/sdk_design.md](docs/sdk_design.md) | SDK API 详细设计 |
477
+ | [CLAUDE.md](CLAUDE.md) | 开发规范与 SOP |
478
+ | [REVIEW_GUIDE.md](REVIEW_GUIDE.md) | 代码审查指南 |
479
+
480
+ ## 设计原则
481
+
482
+ ### 不包含的内容
483
+
484
+ - **Orchestrator**:`GenerationEngine` + `AuditEngine` 原子接口已够用,用户自己写循环
485
+ - **Agent 框架**:流程确定,代码控制即可,不需要 LLM 自主决策
486
+ - **CLI**:MVP 阶段不做,SDK 做扎实后 CLI 是 thin wrapper
487
+
488
+ ### 连贯性保障机制
489
+
490
+ 长文档(8-10章)生成时,通过以下机制保证连贯:
491
+
492
+ 1. **术语表**:每章生成后提取术语,合并到全局字典,后续章节注入
493
+ 2. **前文摘要**:每章生成后压缩上下文为摘要,后续章节携带
494
+ 3. **交叉引用追踪**:记录跨章节引用,后续章节引用时强制一致
495
+
496
+ ## 许可证
497
+
498
+ MIT License