disk-kg 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- disk_kg-1.0.0/.gitignore +23 -0
- disk_kg-1.0.0/CLAUDE.md +2 -0
- disk_kg-1.0.0/NEO4J_GUIDE.md +10 -0
- disk_kg-1.0.0/OPT_ANALYSIS.md +96 -0
- disk_kg-1.0.0/PKG-INFO +293 -0
- disk_kg-1.0.0/PROJECT_OVERVIEW.md +99 -0
- disk_kg-1.0.0/README.md +254 -0
- disk_kg-1.0.0/TODO.md +11 -0
- disk_kg-1.0.0/config/__init__.py +3 -0
- disk_kg-1.0.0/config/llm.py +67 -0
- disk_kg-1.0.0/config.example.toml +31 -0
- disk_kg-1.0.0/disk.py +444 -0
- disk_kg-1.0.0/distiller/__init__.py +5 -0
- disk_kg-1.0.0/distiller/distiller.py +67 -0
- disk_kg-1.0.0/distiller/docling_distiller.py +156 -0
- disk_kg-1.0.0/distiller/docx_distiller.py +241 -0
- disk_kg-1.0.0/distiller/pdf_distiller.py +284 -0
- disk_kg-1.0.0/eval/__init__.py +1 -0
- disk_kg-1.0.0/eval/eval_entity_quality.py +448 -0
- disk_kg-1.0.0/eval/eval_relation_quality.py +474 -0
- disk_kg-1.0.0/extractor/__init__.py +6 -0
- disk_kg-1.0.0/extractor/entities_extractor.py +92 -0
- disk_kg-1.0.0/extractor/extractor.py +207 -0
- disk_kg-1.0.0/extractor/relations_extractor.py +91 -0
- disk_kg-1.0.0/manager/__init__.py +3 -0
- disk_kg-1.0.0/manager/kg_manager.py +22 -0
- disk_kg-1.0.0/merger/__init__.py +3 -0
- disk_kg-1.0.0/merger/merger.py +140 -0
- disk_kg-1.0.0/models/__init__.py +2 -0
- disk_kg-1.0.0/models/knowledge_graph.py +67 -0
- disk_kg-1.0.0/models/neo4j_connector.py +109 -0
- disk_kg-1.0.0/pyproject.toml +82 -0
- disk_kg-1.0.0/relate.md +133 -0
- disk_kg-1.0.0/requirements.txt +14 -0
- disk_kg-1.0.0/tests/document_test.docx +0 -0
- disk_kg-1.0.0/tests/sample.pdf +0 -0
- disk_kg-1.0.0/tests/test_bold.ipynb +160 -0
- disk_kg-1.0.0/tests/test_build_kg.ipynb +2966 -0
- disk_kg-1.0.0/tests/test_build_kg.py +19 -0
- disk_kg-1.0.0/tests/test_description.py +19 -0
- disk_kg-1.0.0/tests/test_docx_distiller.py +66 -0
- disk_kg-1.0.0/tests/test_extractor.ipynb +327 -0
- disk_kg-1.0.0/tests/test_llm_embedding.py +10 -0
- disk_kg-1.0.0/tests/test_llm_function.py +71 -0
- disk_kg-1.0.0/tests/test_merge.ipynb +184 -0
- disk_kg-1.0.0/tests/test_ocr_fix.py +28 -0
- disk_kg-1.0.0/tests/test_pdf_distiller.ipynb +836 -0
- disk_kg-1.0.0/tests/test_token_usage.py +12 -0
- disk_kg-1.0.0/tests/test_visualize_kg.py +90 -0
- disk_kg-1.0.0/tests/with_image.pdf +0 -0
- disk_kg-1.0.0/tests//346/235/216/346/230/2161.pdf +0 -0
- disk_kg-1.0.0/tests//346/235/216/346/230/2162.pdf +0 -0
- disk_kg-1.0.0/utils/__init__.py +29 -0
- disk_kg-1.0.0/utils/bold_text.py +19 -0
- disk_kg-1.0.0/utils/checkpoint_helper.py +23 -0
- disk_kg-1.0.0/utils/lang_detect.py +153 -0
- disk_kg-1.0.0/utils/parser.py +64 -0
- disk_kg-1.0.0/utils/prompts.py +304 -0
- disk_kg-1.0.0/utils/schemas.py +21 -0
- disk_kg-1.0.0/utils/token_tracker.py +303 -0
- disk_kg-1.0.0/uv.lock +5603 -0
disk_kg-1.0.0/.gitignore
ADDED
disk_kg-1.0.0/CLAUDE.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
# DISK 项目:最小知识原则(迪米特法则)违反情况分析报告
|
|
2
|
+
|
|
3
|
+
本文档基于对 `DISK` 项目核心模块(`disk.py`, `extractor/`, `merger/`, `manager/`, `models/`)的代码审计,指出其中违反**最小知识原则(Least Knowledge Principle / Law of Demeter)**的设计缺陷,并提出改进建议。
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 核心原则定义
|
|
8
|
+
**最小知识原则(迪米特法则)**:一个软件实体应当尽可能少地与其他实体发生相互作用。即:只与你的直接朋友交谈,不跟“陌生人的朋友”说话。
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## 1. 穿透访问内部属性(火车残骸现象)
|
|
13
|
+
|
|
14
|
+
### 表现地点:`extractor/extractor.py`
|
|
15
|
+
**代码示例:**
|
|
16
|
+
```python
|
|
17
|
+
# Extractor 类中直接访问了 Parser 内部的 embeddings 对象
|
|
18
|
+
self.parser.embeddings.embed_query(entity["name"])
|
|
19
|
+
```
|
|
20
|
+
* **分析:** `Extractor` 实例持有 `Parser`,而 `Parser` 持有 `embeddings`。`Extractor` 越过 `Parser` 直接操作其内部构件。
|
|
21
|
+
* **违反原因:** `Extractor` 知道了它不该知道的秘密——`Parser` 是如何实现向量化的。
|
|
22
|
+
* **后果:** 如果 `Parser` 未来更换了向量化方式(例如改为调用 REST API 而不再持有 `embeddings` 对象),`Extractor` 必须跟着修改代码。
|
|
23
|
+
|
|
24
|
+
### 表现地点:`disk.py`
|
|
25
|
+
**代码示例:**
|
|
26
|
+
```python
|
|
27
|
+
# DISK 类通过 manager 访问 kg 实例,再访问其内部的列表
|
|
28
|
+
all_entities = self.kg_manager.kg.entities
|
|
29
|
+
```
|
|
30
|
+
* **分析:** 典型的“火车残骸”式调用:`A -> B -> C -> D`。
|
|
31
|
+
* **改进建议:** `KGManager` 应提供 `get_all_entities()` 方法,隐藏 `KnowledgeGraph` 的内部结构。
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## 2. 外部直接干预对象状态
|
|
36
|
+
|
|
37
|
+
### 表现地点:`disk.py`
|
|
38
|
+
**代码示例:**
|
|
39
|
+
```python
|
|
40
|
+
# DISK 主类直接修改了抽取器的内部属性
|
|
41
|
+
self.entities_extractor.prompts = prompts
|
|
42
|
+
self.relations_extractor.prompts = prompts
|
|
43
|
+
self.extractor.prompts = prompts
|
|
44
|
+
```
|
|
45
|
+
* **分析:** `DISK` 类在检测到语言后,直接覆盖了 `Extractor` 实例的成员变量。
|
|
46
|
+
* **违反原因:** 它假设了 `Extractor` 内部一定有一个名为 `prompts` 的属性,这破坏了封装性。
|
|
47
|
+
* **改进建议:** 应该在 `Extractor` 中定义 `update_language_config(prompts)` 方法。
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## 3. 逻辑职责分配不当(上帝类倾向)
|
|
52
|
+
|
|
53
|
+
### 表现地点:`merger/merger.py`
|
|
54
|
+
**代码示例:**
|
|
55
|
+
```python
|
|
56
|
+
# Merger 手动拼凑新的 Entity 对象,深入了解其所有字段
|
|
57
|
+
merged_entity = Entity(
|
|
58
|
+
name=e1.name,
|
|
59
|
+
label=e1.label,
|
|
60
|
+
embedding=emb1[idx1].tolist(),
|
|
61
|
+
description=e1.description if hasattr(e1, 'description') else "",
|
|
62
|
+
)
|
|
63
|
+
```
|
|
64
|
+
* **分析:** `Merger` 类作为一个工具类,却对 `Entity` 的构造函数、属性名(`description`)、甚至向量的转换细节(`tolist()`)了如指掌。
|
|
65
|
+
* **违反原因:** `Merger` 知道得太多了。它不仅知道如何合并,还知道如何“制造”实体。
|
|
66
|
+
* **改进建议:** 应该在 `Entity` 类中实现 `merge_with(other)` 方法。
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## 4. 领域模型与持久化层的深度耦合
|
|
71
|
+
|
|
72
|
+
### 表现地点:`models/neo4j_connector.py`
|
|
73
|
+
**代码示例:**
|
|
74
|
+
```python
|
|
75
|
+
# 连接器深入读取实体的每一个字段来构建 Cypher 查询
|
|
76
|
+
parameters = {
|
|
77
|
+
"name": entity.name,
|
|
78
|
+
"embedding": entity.embedding,
|
|
79
|
+
"description": entity.description if hasattr(entity, 'description') else ""
|
|
80
|
+
}
|
|
81
|
+
```
|
|
82
|
+
* **分析:** 数据库连接器应该只负责执行查询。目前它必须了解 `Entity` 所有的属性名称。
|
|
83
|
+
* **后果:** 只要 `Entity` 增加或修改一个字段(例如把 `name` 改为 `display_name`),连接器就会崩溃。
|
|
84
|
+
* **改进建议:** `Entity` 应该提供一个 `to_dict()` 方法,连接器只负责消费这个字典。
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## 总结与改进方案
|
|
89
|
+
|
|
90
|
+
项目目前的实现风格偏向于**过程式拼装**,类更像是带方法的“数据结构”。
|
|
91
|
+
|
|
92
|
+
### 改进路线图:
|
|
93
|
+
1. **封装委派**:在 `Parser` 中提供 `embed(text)` 接口,禁止外部直接访问 `self.embeddings`。
|
|
94
|
+
2. **隐藏内部结构**:`KGManager` 应该作为 `KnowledgeGraph` 的唯一出口,禁止外部通过 `.kg` 属性进行链式访问。
|
|
95
|
+
3. **行为下移**:将属性合并、对象克隆、序列化等逻辑从 `Merger` 和 `Connector` 中移回 `Entity` 和 `Relation` 类中。
|
|
96
|
+
4. **接口契约**:使用方法调用代替直接属性赋值,确保对象内部状态的安全性。
|
disk_kg-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: disk-kg
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: DISK (Domain Incremental conStruction of Knowledge graph) - A tool for distilling text from documents, extracting entities and relations, and building domain knowledge graphs
|
|
5
|
+
Author-email: Liu Huasheng <clipg@qq.com>, Wu Junkai <wu.junkai@qq.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Keywords: entity-extraction,knowledge-graph,llm,nlp,pdf-processing
|
|
8
|
+
Classifier: Development Status :: 3 - Alpha
|
|
9
|
+
Classifier: Intended Audience :: Developers
|
|
10
|
+
Classifier: Intended Audience :: Science/Research
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
14
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
15
|
+
Requires-Python: >=3.11
|
|
16
|
+
Requires-Dist: docx2txt>=0.9
|
|
17
|
+
Requires-Dist: jieba
|
|
18
|
+
Requires-Dist: langchain-community>=0.4.1
|
|
19
|
+
Requires-Dist: langchain-core
|
|
20
|
+
Requires-Dist: langchain-openai>=1.1.7
|
|
21
|
+
Requires-Dist: neo4j
|
|
22
|
+
Requires-Dist: numpy
|
|
23
|
+
Requires-Dist: openai>=2.21.0
|
|
24
|
+
Requires-Dist: pandas
|
|
25
|
+
Requires-Dist: pdfplumber
|
|
26
|
+
Requires-Dist: pillow
|
|
27
|
+
Requires-Dist: pymupdf
|
|
28
|
+
Requires-Dist: rapidocr-onnxruntime>=1.4.4
|
|
29
|
+
Requires-Dist: tqdm
|
|
30
|
+
Provides-Extra: dev
|
|
31
|
+
Requires-Dist: black>=23.0.0; extra == 'dev'
|
|
32
|
+
Requires-Dist: ipykernel>=6.0.0; extra == 'dev'
|
|
33
|
+
Requires-Dist: jupyter>=1.0.0; extra == 'dev'
|
|
34
|
+
Requires-Dist: mypy>=1.0.0; extra == 'dev'
|
|
35
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
|
|
36
|
+
Requires-Dist: pytest>=7.0.0; extra == 'dev'
|
|
37
|
+
Requires-Dist: ruff>=0.1.0; extra == 'dev'
|
|
38
|
+
Description-Content-Type: text/markdown
|
|
39
|
+
|
|
40
|
+
# DISK
|
|
41
|
+
Domain Incremental conStruction of Knowledge graph.
|
|
42
|
+
|
|
43
|
+
## Overview
|
|
44
|
+
|
|
45
|
+
DISK is a comprehensive toolkit for extracting knowledge from PDF documents and building domain knowledge graphs through text distillation, entity/relation extraction, and semantic merging. The system provides a modular pipeline that transforms unstructured PDF documents into structured knowledge representations.
|
|
46
|
+
|
|
47
|
+
### Core Capabilities
|
|
48
|
+
|
|
49
|
+
- **Document Distillation**: Extract and validate text blocks, tables, and images from PDF documents
|
|
50
|
+
- **Entity Extraction**: Identify and extract domain entities with semantic embeddings
|
|
51
|
+
- **Relation Extraction**: Discover relationships between entities with contextual understanding
|
|
52
|
+
- **Knowledge Graph Construction**: Build and manage knowledge graphs with incremental updates
|
|
53
|
+
- **Semantic Merging**: Intelligently merge similar entities and relations using cosine similarity
|
|
54
|
+
|
|
55
|
+
## Architecture
|
|
56
|
+
|
|
57
|
+
### System Architecture
|
|
58
|
+
|
|
59
|
+
```mermaid
|
|
60
|
+
graph TB
|
|
61
|
+
subgraph "Input Layer"
|
|
62
|
+
PDF[PDF Document]
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
subgraph "Distillation Layer"
|
|
66
|
+
Distiller[PDF Distiller]
|
|
67
|
+
TextBlocks[Validated Text Blocks]
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
subgraph "Extraction Layer"
|
|
71
|
+
EntExtractor[Entity Extractor]
|
|
72
|
+
RelExtractor[Relation Extractor]
|
|
73
|
+
UnifiedExtractor[Unified Extractor]
|
|
74
|
+
Entities[(Entities + Embeddings)]
|
|
75
|
+
Relations[(Relations + Embeddings)]
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
subgraph "Processing Layer"
|
|
79
|
+
Merger[Semantic Merger]
|
|
80
|
+
Manager[KG Manager]
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
subgraph "Output Layer"
|
|
84
|
+
KG[Knowledge Graph]
|
|
85
|
+
Logs[Logs & Results]
|
|
86
|
+
end
|
|
87
|
+
|
|
88
|
+
subgraph "Configuration"
|
|
89
|
+
Config[LLM Config]
|
|
90
|
+
Embed[Embeddings Model]
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
PDF --> Distiller
|
|
94
|
+
Distiller --> TextBlocks
|
|
95
|
+
|
|
96
|
+
TextBlocks --> EntExtractor
|
|
97
|
+
TextBlocks --> RelExtractor
|
|
98
|
+
TextBlocks --> UnifiedExtractor
|
|
99
|
+
|
|
100
|
+
EntExtractor --> Entities
|
|
101
|
+
RelExtractor --> Relations
|
|
102
|
+
UnifiedExtractor --> Entities
|
|
103
|
+
UnifiedExtractor --> Relations
|
|
104
|
+
|
|
105
|
+
Entities --> Merger
|
|
106
|
+
Relations --> Merger
|
|
107
|
+
Merger --> Manager
|
|
108
|
+
|
|
109
|
+
Manager --> KG
|
|
110
|
+
Manager --> Logs
|
|
111
|
+
|
|
112
|
+
Config --> EntExtractor
|
|
113
|
+
Config --> RelExtractor
|
|
114
|
+
Config --> UnifiedExtractor
|
|
115
|
+
Embed --> EntExtractor
|
|
116
|
+
Embed --> RelExtractor
|
|
117
|
+
Embed --> UnifiedExtractor
|
|
118
|
+
Embed --> Merger
|
|
119
|
+
|
|
120
|
+
style PDF fill:#e1f5fe
|
|
121
|
+
style KG fill:#c8e6c9
|
|
122
|
+
style Distiller fill:#fff3e0
|
|
123
|
+
style Merger fill:#f3e5f5
|
|
124
|
+
style Manager fill:#e8f5e9
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### Module Structure
|
|
128
|
+
|
|
129
|
+
```mermaid
|
|
130
|
+
graph LR
|
|
131
|
+
subgraph DISK
|
|
132
|
+
DiskMain[disk.py<br/>Main Entry Point]
|
|
133
|
+
|
|
134
|
+
subgraph Core
|
|
135
|
+
Distiller[distiller/<br/>PDF Distillation]
|
|
136
|
+
Extractor[extractor/<br/>Information Extraction]
|
|
137
|
+
MergerMod[merger/<br/>Knowledge Merging]
|
|
138
|
+
ManagerMod[manager/<br/>KG Management]
|
|
139
|
+
end
|
|
140
|
+
|
|
141
|
+
subgraph Support
|
|
142
|
+
Models[models/<br/>Data Models]
|
|
143
|
+
Utils[utils/<br/>Utilities]
|
|
144
|
+
ConfigMod[config/<br/>Configuration]
|
|
145
|
+
end
|
|
146
|
+
end
|
|
147
|
+
|
|
148
|
+
DiskMain --> Distiller
|
|
149
|
+
DiskMain --> Extractor
|
|
150
|
+
DiskMain --> MergerMod
|
|
151
|
+
DiskMain --> ManagerMod
|
|
152
|
+
|
|
153
|
+
Extractor --> Models
|
|
154
|
+
MergerMod --> Models
|
|
155
|
+
ManagerMod --> Models
|
|
156
|
+
|
|
157
|
+
Distiller --> Utils
|
|
158
|
+
Extractor --> Utils
|
|
159
|
+
ManagerMod --> Utils
|
|
160
|
+
|
|
161
|
+
DiskMain --> ConfigMod
|
|
162
|
+
|
|
163
|
+
style DiskMain fill:#1976d2,color:#fff
|
|
164
|
+
style Distiller fill:#ffa726
|
|
165
|
+
style Extractor fill:#42a5f5
|
|
166
|
+
style MergerMod fill:#ab47bc
|
|
167
|
+
style ManagerMod fill:#66bb6a
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### Data Flow
|
|
171
|
+
|
|
172
|
+
```mermaid
|
|
173
|
+
sequenceDiagram
|
|
174
|
+
participant User
|
|
175
|
+
participant DISK
|
|
176
|
+
participant Distiller
|
|
177
|
+
participant Extractor
|
|
178
|
+
participant Merger
|
|
179
|
+
participant Manager
|
|
180
|
+
participant KG
|
|
181
|
+
|
|
182
|
+
User->>DISK: build_knowledge_graph(pdf_path)
|
|
183
|
+
DISK->>Distiller: extract_text_blocks(pdf)
|
|
184
|
+
Distiller-->>DISK: validated_text_blocks
|
|
185
|
+
|
|
186
|
+
loop For each text block
|
|
187
|
+
DISK->>Extractor: extract_entities(text)
|
|
188
|
+
Extractor-->>DISK: entities + embeddings
|
|
189
|
+
|
|
190
|
+
DISK->>Extractor: extract_relations(text)
|
|
191
|
+
Extractor-->>DISK: relations + embeddings
|
|
192
|
+
|
|
193
|
+
DISK->>Merger: merge(new, existing)
|
|
194
|
+
Merger-->>DISK: merged entities/relations
|
|
195
|
+
end
|
|
196
|
+
|
|
197
|
+
DISK->>Manager: add_entities(entities)
|
|
198
|
+
DISK->>Manager: add_relations(relations)
|
|
199
|
+
Manager->>KG: update_knowledge_graph
|
|
200
|
+
DISK-->>User: Knowledge Graph
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
## Modules
|
|
204
|
+
|
|
205
|
+
### Distillation Module (distiller/)
|
|
206
|
+
|
|
207
|
+
- **pdf_distiller**
|
|
208
|
+
- extract **paragraphs** with intelligent validation
|
|
209
|
+
- extract **tables**(to be improved)
|
|
210
|
+
- extract **imgs**(to be improved)
|
|
211
|
+
- filter out low-quality text blocks (references, incomplete sentences)
|
|
212
|
+
|
|
213
|
+
### Extraction Module (extractor/)
|
|
214
|
+
|
|
215
|
+
- **entities_extractor**
|
|
216
|
+
- extract domain entities with labels and descriptions
|
|
217
|
+
- generate semantic embeddings for each entity
|
|
218
|
+
|
|
219
|
+
- **relations_extractor**
|
|
220
|
+
- extract relationships between entities
|
|
221
|
+
- generate semantic embeddings for each relation
|
|
222
|
+
|
|
223
|
+
- **extractor** (unified)
|
|
224
|
+
- extract both entities and relations in a single pass
|
|
225
|
+
- optimized for incremental processing
|
|
226
|
+
|
|
227
|
+
### Processing Modules
|
|
228
|
+
|
|
229
|
+
- **extract entities**
|
|
230
|
+
- **extract relationships**
|
|
231
|
+
- **semantic merging** (merger/)
|
|
232
|
+
- merge similar entities using cosine similarity
|
|
233
|
+
- update relations after entity merging
|
|
234
|
+
- configurable threshold (default: 0.8)
|
|
235
|
+
- **construct knowledge graph** (manager/)
|
|
236
|
+
- incremental knowledge graph construction
|
|
237
|
+
- deduplication of entities and relations
|
|
238
|
+
|
|
239
|
+
## Config
|
|
240
|
+
|
|
241
|
+
**env**
|
|
242
|
+
```bash
|
|
243
|
+
# use uv to manage the environment
|
|
244
|
+
uv venv
|
|
245
|
+
uv sync
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
**LLM Configuration**
|
|
249
|
+
|
|
250
|
+
1. Copy the example configuration file:
|
|
251
|
+
```bash
|
|
252
|
+
cp config.example.toml config.toml
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
2. Edit `config.toml` to set your API keys and preferences:
|
|
256
|
+
|
|
257
|
+
```toml
|
|
258
|
+
[disk]
|
|
259
|
+
llm = "openai" # Choose provider: openai, qwen, ollama, etc.
|
|
260
|
+
|
|
261
|
+
[disk.embeddings]
|
|
262
|
+
model = "text-embedding-3-small"
|
|
263
|
+
api_key = "ai-..."
|
|
264
|
+
api_url = "https://api.openai.com/v1"
|
|
265
|
+
|
|
266
|
+
[model.openai]
|
|
267
|
+
api_url = "https://api.openai.com/v1"
|
|
268
|
+
api_key = "ai-..."
|
|
269
|
+
model = "gpt-4o"
|
|
270
|
+
|
|
271
|
+
[model.other]
|
|
272
|
+
api_url = "https://api.otherprovider.com/v1"
|
|
273
|
+
api_key = "sk-..."
|
|
274
|
+
model = "gpt-4o"
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
3. Supported providers:
|
|
278
|
+
- **OpenAI** (default)
|
|
279
|
+
- **Qwen** (DashScope)
|
|
280
|
+
- **Kimi** (Moonshot)
|
|
281
|
+
- **Ollama** (Local)
|
|
282
|
+
- **All other providers** that support OpenAI-compatible APIs
|
|
283
|
+
|
|
284
|
+
You can switch providers by changing the `llm` field in `[disk]` or using the runtime `switch()` function.
|
|
285
|
+
|
|
286
|
+
|
|
287
|
+
## Contrast
|
|
288
|
+
|
|
289
|
+
### merge
|
|
290
|
+
- itext2kg
|
|
291
|
+
```
|
|
292
|
+
[INFO] Wohoo! Entity was matched --- [poor deep semantic understanding in traditional ie models:Limitation] --merged--> [cosine similarity ignores deep semantic differences:Limitation]
|
|
293
|
+
```
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
**项目概述**
|
|
2
|
+
|
|
3
|
+
- **名称**: DISK
|
|
4
|
+
- **简介**: DISK(Domain Incremental conStruction of Knowledge graph)是一个从文档(当前以 PDF 为主)中蒸馏文本、抽取实体与关系、构建并合并领域知识图谱的工具集。项目通过将 PDF 文本块提取、调用 LLM 进行结构化信息抽取、对实体关系做向量化并合并近似实体,逐步构建知识图谱。
|
|
5
|
+
|
|
6
|
+
**设计目标**
|
|
7
|
+
|
|
8
|
+
- **可扩展的文档蒸馏**: 支持从 PDF 提取段落、表格和图片(含 OCR)的功能模块。
|
|
9
|
+
- **基于 LLM 的信息抽取**: 使用可替换的 LLM 与向量化接口完成实体与关系抽取以及嵌入生成。
|
|
10
|
+
- **语义合并与知识管理**: 提供实体/关系合并策略及知识图(KG)管理接口,便于增量构建与持久化。
|
|
11
|
+
|
|
12
|
+
**架构概览**
|
|
13
|
+
|
|
14
|
+
- **输入**: PDF 文档
|
|
15
|
+
- **阶段**: 蒸馏(distiller) -> 抽取(extractor) -> 合并(merger) -> 管理/构建(manager/models)
|
|
16
|
+
- **输出**: `KnowledgeGraph` 实例(包含 `Entity` 与 `Relation`)
|
|
17
|
+
|
|
18
|
+
**主要文件与模块说明**
|
|
19
|
+
|
|
20
|
+
- **`disk.py`**:
|
|
21
|
+
- **作用**: 提供 `DISK` 类作为主流程控制器,封装整个构建知识图的流水线。
|
|
22
|
+
- **关键方法**: `build_knowledge_graph`(分步:蒸馏 -> 抽实体 -> 抽关系 -> 构建 KG),`build_knowledge_graph_single_extractor`(用单一抽取器同时抽实体与关系并合并)。
|
|
23
|
+
|
|
24
|
+
- **`distiller/pdf_distiller.py` (`PDFDistiller`)**:
|
|
25
|
+
- **作用**: 使用 `PyMuPDF`(fitz)、`pdfplumber`、`PaddleOCR` 等从 PDF 中提取文本块、表格与图片+OCR。
|
|
26
|
+
- **特点**: 将页面文本分块为「段落」,内置基本筛选(去除参考文献、短文本、非句子等);提供表格与图片提取的多种策略与 TODO 优化点。
|
|
27
|
+
|
|
28
|
+
- **`extractor/`**:
|
|
29
|
+
- **`extractor.py` (`Extractor`)**: 调用 `Parser`(基于 LangChain 风格)并使用预设 prompt(`utils.prompts.EXTRACT_PROMPT`)把文本解析为结构化 JSON(`RelationsSchema` / `EntitiesSchema`),随后为实体/关系生成 embedding 并返回嵌入化对象列表。
|
|
30
|
+
- 项目里还有拆分的 `EntitiesExtractor` 与 `RelationsExtractor`(在导入点可见),用于分别提取实体或关系以支持分步流水线。
|
|
31
|
+
|
|
32
|
+
- **`merger/merger.py` (`Merger`)**:
|
|
33
|
+
- **作用**: 基于实体 embedding 的余弦相似度(PyTorch 实现)合并相近实体,并在合并后更新/合并关系集合。
|
|
34
|
+
- **阈值**: 可配置 `threshold`(默认 0.8),日志会输出相似度矩阵与合并信息到 `logs/` 目录。
|
|
35
|
+
|
|
36
|
+
- **`manager/kg_manager.py` (`KGManager`)**:
|
|
37
|
+
- **作用**: 管理 `KnowledgeGraph` 实例,提供 `add_entities` 与 `add_relations` 等增量写入接口(包含去重逻辑)。
|
|
38
|
+
|
|
39
|
+
- **`models/knowledge_graph.py`**:
|
|
40
|
+
- **核心类**: `Entity`(`label`, `name`, `embedding`)、`Relation`(`start_entity`, `end_entity`, `label`, `name`, `embedding`)、`KnowledgeGraph`(保存实体/关系列表)。
|
|
41
|
+
|
|
42
|
+
- **`config/llm.py`**:
|
|
43
|
+
- **作用**: 提供默认 LLM 与 embeddings 的配置(示例中使用 Qwen via `ChatTongyi` 与 `DashScopeEmbeddings`),并示范如何替换为 Ollama/本地模型。
|
|
44
|
+
- **备注**: 真实使用前请在 `config/config.py`(或 `config/config.py` 中的 `api_key`)中配置 API key。
|
|
45
|
+
|
|
46
|
+
- **`utils/parser.py`**:
|
|
47
|
+
- **作用**: 封装基于 LangChain 风格的 `JsonOutputParser` 与 `PromptTemplate`,将 LLM 的输出解析为 Pydantic 结构化对象。
|
|
48
|
+
|
|
49
|
+
- **其他工具**:
|
|
50
|
+
- `utils/checkpoint_helper.py`: 用于保存与加载处理状态(代码中 `load_checkpoint` / `save_checkpoint` 被 `disk.py` 用于断点恢复)。
|
|
51
|
+
- `utils/prompts.py`、`utils/schemas.py`: 定义 prompt 模板与 pydantic 输出 schema,用于约束 LLM 输出格式。
|
|
52
|
+
|
|
53
|
+
**运行与安装**
|
|
54
|
+
|
|
55
|
+
- **环境**: 推荐 Python 3.10
|
|
56
|
+
- **创建虚拟环境与安装依赖** (在 PowerShell 中运行):
|
|
57
|
+
|
|
58
|
+
```powershell
|
|
59
|
+
python -m venv .venv
|
|
60
|
+
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process; .\.venv\Scripts\Activate.ps1
|
|
61
|
+
pip install -r requirements.txt
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
- **配置 LLM API key**:
|
|
65
|
+
- 编辑 `config/config.py` (或 `config/config.py` 中的 `api_key`)来填入你的 API Key。
|
|
66
|
+
|
|
67
|
+
- **示例:用 `DISK` 构建知识图** (Python 交互或脚本):
|
|
68
|
+
|
|
69
|
+
```python
|
|
70
|
+
from disk import DISK
|
|
71
|
+
from config.llm import llm, embeddings
|
|
72
|
+
|
|
73
|
+
disk = DISK(llm=llm, embeddings=embeddings)
|
|
74
|
+
kg = disk.build_knowledge_graph("path/to/your.pdf")
|
|
75
|
+
print(len(kg.entities), len(kg.relations))
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
**测试与示例**
|
|
79
|
+
|
|
80
|
+
- 仓库 `tests/` 下包含若干 Notebook 示例(`test_build_kg.ipynb`、`test_extractor.ipynb` 等),可作为快速上手的示例流程。
|
|
81
|
+
|
|
82
|
+
**日志与结果目录**
|
|
83
|
+
|
|
84
|
+
- 提取与合并过程会向 `../logs/` 写入诊断信息(相似度矩阵、提取段落、合并记录等);提取输出也会追加到 `../results/`(`extracted_relations.json`/`extracted_entities.json`)。
|
|
85
|
+
|
|
86
|
+
**扩展点与注意事项**
|
|
87
|
+
|
|
88
|
+
- OCR、表格识别与多语言支持仍有优化空间(代码中标注了多处 TODO)。
|
|
89
|
+
- 当前实体合并基于 embedding 的余弦阈值,可能需要更复杂的语义/规则合并策略以减少误合并。
|
|
90
|
+
- LLM 调用和提示(`utils/prompts.py`)是抽取质量的关键,建议结合领域示例进一步工程化 prompt 与 schema。
|
|
91
|
+
|
|
92
|
+
**下一步建议**
|
|
93
|
+
|
|
94
|
+
- 添加单元测试覆盖关键模块(`distiller`、`extractor`、`merger`)以便回归检测。
|
|
95
|
+
- 提供一个简单的 CLI 或示例脚本,用于批量处理 PDF 并导出 KG(如 JSON/GraphML)。
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
文档已生成:`PROJECT_OVERVIEW.md`(位于仓库根目录)。如需我把这些更改提交到 git、或补充具体示例/图示(比如架构流程图),我可以继续操作。
|