union_kb_ingest 1.0.1 → 1.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +24 -19
- package/app_config.py +20 -0
- package/config/config.yaml +10 -5
- package/ingest.py +101 -35
- package/normalizer.py +619 -69
- package/package.json +3 -3
- package/parser.py +116 -1
- package/prompts/generate_kb_items.md +2 -2
- package/prompts//350/201/224/345/220/210/350/277/220/347/273/264/347/237/245/350/257/206/345/272/223/345/273/272/347/253/213/350/247/204/350/214/203.md +105 -213
- package/requirements.txt +1 -4
- package/schemas.py +15 -3
- package/splitter.py +8 -0
- package/validator.py +63 -6
- package/writer.py +4 -1
- package/drafts/.gitkeep +0 -1
- /package/{approved → result}/.gitkeep +0 -0
package/README.md
CHANGED
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
2. 通过 Docling slim 的离线文本解析能力生成统一 Markdown 中间格式。
|
|
9
9
|
3. 按章节、场景、规则、指标等粒度切割。
|
|
10
10
|
4. 可选调用大模型,把内容整理为项目知识库规范要求的 Markdown 文件。
|
|
11
|
-
5.
|
|
11
|
+
5. 默认生成可直接交给知识库项目使用的 `status: active` Markdown 文件。
|
|
12
12
|
|
|
13
13
|
启用大模型时,工具会把 `prompts/联合运维知识库建立规范.md` 作为格式和质量约束放入提示词,要求模型依据原文语义判断业务场景、模块、角色、标签和风险等级。代码中的启发式生成只作为未启用大模型或调用失败时的兜底,不使用预设业务关键词去指导大模型输出。
|
|
14
14
|
|
|
@@ -37,13 +37,13 @@ python -m pip install -r requirements.txt
|
|
|
37
37
|
input/
|
|
38
38
|
```
|
|
39
39
|
|
|
40
|
-
|
|
40
|
+
生成知识库文件:
|
|
41
41
|
|
|
42
42
|
```bash
|
|
43
43
|
python ingest.py draft
|
|
44
44
|
```
|
|
45
45
|
|
|
46
|
-
如果 `
|
|
46
|
+
如果 `result/` 中已有生成文件,命令会先询问是否覆盖。选择 `y` 后会清空 `result/` 中已有生成文件,再重新生成;选择其他内容会直接退出,避免多次生成结果相互影响。
|
|
47
47
|
|
|
48
48
|
只解析为中间 Markdown:
|
|
49
49
|
|
|
@@ -51,50 +51,55 @@ python ingest.py draft
|
|
|
51
51
|
python ingest.py parse
|
|
52
52
|
```
|
|
53
53
|
|
|
54
|
-
|
|
54
|
+
校验生成结果:
|
|
55
55
|
|
|
56
56
|
```bash
|
|
57
57
|
python ingest.py validate
|
|
58
58
|
```
|
|
59
59
|
|
|
60
|
-
|
|
60
|
+
默认目录为 `input/`、`parsed/` 和 `result/`。只有需要处理其他目录时,才使用 `--input` 或 `--output` 覆盖。
|
|
61
61
|
|
|
62
|
-
|
|
63
|
-
python ingest.py promote
|
|
64
|
-
```
|
|
62
|
+
`draft` 默认按 `config/config.yaml` 的 `draft.max_chars` 控制单次送入模型的原文长度,并额外提供文档目录和相邻片段摘要作为辅助上下文。这样可以降低私有模型单轮负载,同时尽量保留前后章节关系。命令行仍可用 `--max-chars` 临时覆盖。
|
|
65
63
|
|
|
66
|
-
|
|
64
|
+
每条知识库文件会写入分类画像元数据:`category`、`category_description` 和 `category_keywords`。这些字段优先来自源文件一级标题、首页标题、章节目录和文件名,用于标识一个批次/业务场景的大类。后续 RAG 入库和检索时,应把这些字段写入向量库 metadata,并用于分类过滤、查询路由或重排加权,降低不同场景之间因为相似词命中而串场的概率。
|
|
65
|
+
|
|
66
|
+
生成结果会按原始输入遍历顺序写入 `source_order`,并用 `000001-...md` 这样的文件名前缀保持目录排序与原文从上到下的顺序一致。页码只写入 Front Matter 的 `source_pages`/`source_trace` 和正文 `## 5. 来源依据`,不会进入正文 `## 1. 核心内容` 到 `## 4. 关联能力`。
|
|
67
67
|
|
|
68
68
|
## 大模型配置
|
|
69
69
|
|
|
70
|
-
|
|
70
|
+
默认不强制调用大模型,会使用启发式模板生成知识库文件。
|
|
71
71
|
|
|
72
72
|
如果要启用大模型整理,修改 `config/config.yaml`:
|
|
73
73
|
|
|
74
74
|
```yaml
|
|
75
75
|
llm:
|
|
76
76
|
enabled: true
|
|
77
|
-
base_url: "https://
|
|
78
|
-
api_key: "your-api-key"
|
|
79
|
-
model: "
|
|
77
|
+
base_url: "https://open.bigmodel.cn/api/paas/v4/"
|
|
78
|
+
api_key: "your-zhipu-api-key"
|
|
79
|
+
model: "glm-4.7"
|
|
80
80
|
timeout_seconds: 120
|
|
81
|
-
max_tokens:
|
|
81
|
+
max_tokens: 8192
|
|
82
82
|
temperature: 0.1
|
|
83
|
+
|
|
84
|
+
draft:
|
|
85
|
+
max_chars: 3600
|
|
86
|
+
context_chars: 800
|
|
87
|
+
outline_max_sections: 40
|
|
83
88
|
```
|
|
84
89
|
|
|
85
90
|
也可以继续使用环境变量覆盖配置文件:
|
|
86
91
|
|
|
87
92
|
```bash
|
|
88
93
|
export KB_LLM_ENABLED=true
|
|
89
|
-
export KB_LLM_BASE_URL="https://
|
|
90
|
-
export KB_LLM_API_KEY="your-api-key"
|
|
91
|
-
export KB_LLM_MODEL="
|
|
94
|
+
export KB_LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4/"
|
|
95
|
+
export KB_LLM_API_KEY="your-zhipu-api-key"
|
|
96
|
+
export KB_LLM_MODEL="glm-4.7"
|
|
92
97
|
```
|
|
93
98
|
|
|
94
|
-
`base_url`
|
|
99
|
+
工具通过 Z.AI 新版 Python SDK 调用中文智谱开放平台 GLM,依赖固定为 `zai-sdk==0.2.2`,客户端固定使用官方中文写法 `from zai import ZhipuAiClient`,`base_url` 使用 `https://open.bigmodel.cn/api/paas/v4/`。工具不再包含旧 `zhipuai` SDK、国际版 `ZaiClient` 或 OpenAI 调用路径,也不 import 项目 `src` 代码。
|
|
95
100
|
|
|
96
101
|
## 与线上项目的关系
|
|
97
102
|
|
|
98
|
-
这个工具只产出符合规范的 `*.md`
|
|
103
|
+
这个工具只产出符合规范的 `*.md` 文件到 `result/`,后续由线上知识库加载流程处理。
|
|
99
104
|
|
|
100
105
|
建议线上打包时排除整个 `tools/kb_ingest` 目录。
|
package/app_config.py
CHANGED
|
@@ -24,6 +24,13 @@ class LlmConfig:
|
|
|
24
24
|
temperature: float = 0.1
|
|
25
25
|
|
|
26
26
|
|
|
27
|
+
@dataclass(frozen=True)
|
|
28
|
+
class DraftConfig:
|
|
29
|
+
max_chars: int = 3600
|
|
30
|
+
context_chars: int = 800
|
|
31
|
+
outline_max_sections: int = 40
|
|
32
|
+
|
|
33
|
+
|
|
27
34
|
@lru_cache(maxsize=1)
|
|
28
35
|
def get_llm_config() -> LlmConfig:
|
|
29
36
|
raw = _read_config().get("llm", {})
|
|
@@ -41,6 +48,19 @@ def get_llm_config() -> LlmConfig:
|
|
|
41
48
|
)
|
|
42
49
|
|
|
43
50
|
|
|
51
|
+
@lru_cache(maxsize=1)
|
|
52
|
+
def get_draft_config() -> DraftConfig:
|
|
53
|
+
raw = _read_config().get("draft", {})
|
|
54
|
+
if not isinstance(raw, dict):
|
|
55
|
+
raw = {}
|
|
56
|
+
|
|
57
|
+
return DraftConfig(
|
|
58
|
+
max_chars=_env_int("KB_DRAFT_MAX_CHARS", raw.get("max_chars"), 3600),
|
|
59
|
+
context_chars=_env_int("KB_DRAFT_CONTEXT_CHARS", raw.get("context_chars"), 800),
|
|
60
|
+
outline_max_sections=_env_int("KB_DRAFT_OUTLINE_MAX_SECTIONS", raw.get("outline_max_sections"), 40),
|
|
61
|
+
)
|
|
62
|
+
|
|
63
|
+
|
|
44
64
|
def _read_config() -> Dict[str, Any]:
|
|
45
65
|
path = Path(os.environ.get("KB_INGEST_CONFIG", DEFAULT_CONFIG_PATH))
|
|
46
66
|
if not path.exists():
|
package/config/config.yaml
CHANGED
|
@@ -1,8 +1,13 @@
|
|
|
1
1
|
llm:
|
|
2
|
-
enabled:
|
|
2
|
+
enabled: false
|
|
3
3
|
timeout_seconds: 120
|
|
4
|
-
max_tokens:
|
|
4
|
+
max_tokens: 8192
|
|
5
5
|
temperature: 0.1
|
|
6
|
-
api_key: "
|
|
7
|
-
model: "
|
|
8
|
-
base_url: "https://open.bigmodel.cn/api/paas/v4/"
|
|
6
|
+
api_key: "your-zhipu-api-key"
|
|
7
|
+
model: "GLM-4.7-Flash"
|
|
8
|
+
base_url: "https://open.bigmodel.cn/api/paas/v4/"
|
|
9
|
+
|
|
10
|
+
draft:
|
|
11
|
+
max_chars: 3600
|
|
12
|
+
context_chars: 800
|
|
13
|
+
outline_max_sections: 40
|
package/ingest.py
CHANGED
|
@@ -2,16 +2,18 @@
|
|
|
2
2
|
from __future__ import annotations
|
|
3
3
|
|
|
4
4
|
import argparse
|
|
5
|
-
import shutil
|
|
6
5
|
import sys
|
|
7
6
|
from pathlib import Path
|
|
7
|
+
from typing import List
|
|
8
8
|
|
|
9
9
|
CURRENT_DIR = Path(__file__).resolve().parent
|
|
10
10
|
if str(CURRENT_DIR) not in sys.path:
|
|
11
11
|
sys.path.insert(0, str(CURRENT_DIR))
|
|
12
12
|
|
|
13
|
+
from app_config import get_draft_config
|
|
13
14
|
from normalizer import normalize_block
|
|
14
15
|
from parser import iter_input_files, parse_document
|
|
16
|
+
from schemas import ParsedBlock
|
|
15
17
|
from splitter import split_blocks
|
|
16
18
|
from validator import validate_dir
|
|
17
19
|
from writer import write_item
|
|
@@ -38,26 +40,36 @@ def cmd_parse(args) -> int:
|
|
|
38
40
|
def cmd_draft(args) -> int:
|
|
39
41
|
input_path = Path(args.input)
|
|
40
42
|
output_dir = Path(args.output)
|
|
41
|
-
approved_dir = Path(args.approved_dir)
|
|
42
|
-
result_dir = Path(args.result_dir)
|
|
43
43
|
|
|
44
44
|
existing = _list_effective_files(output_dir)
|
|
45
|
-
if existing and not _confirm_overwrite(output_dir,
|
|
45
|
+
if existing and not _confirm_overwrite(output_dir, existing):
|
|
46
46
|
print("aborted. existing files were kept.")
|
|
47
47
|
return 0
|
|
48
48
|
|
|
49
49
|
if existing:
|
|
50
|
-
_clear_generated_files(output_dir
|
|
50
|
+
_clear_generated_files(output_dir)
|
|
51
51
|
|
|
52
52
|
output_dir.mkdir(parents=True, exist_ok=True)
|
|
53
53
|
|
|
54
54
|
total_items = 0
|
|
55
|
+
source_order = 0
|
|
56
|
+
draft_config = get_draft_config()
|
|
57
|
+
max_chars = args.max_chars or draft_config.max_chars
|
|
55
58
|
files = iter_input_files(input_path)
|
|
56
59
|
for path in files:
|
|
57
60
|
parsed = parse_document(path)
|
|
58
|
-
blocks = split_blocks(parsed.blocks, max_chars=
|
|
61
|
+
blocks = split_blocks(parsed.blocks, max_chars=max_chars)
|
|
62
|
+
blocks = _attach_block_context(
|
|
63
|
+
blocks,
|
|
64
|
+
context_chars=draft_config.context_chars,
|
|
65
|
+
outline_max_sections=draft_config.outline_max_sections,
|
|
66
|
+
)
|
|
59
67
|
for block in blocks:
|
|
60
|
-
for item in normalize_block(block, status=
|
|
68
|
+
for item in normalize_block(block, status=args.status):
|
|
69
|
+
source_order += 1
|
|
70
|
+
item.source_order = source_order
|
|
71
|
+
item.source_pages = sorted(set(block.pages))
|
|
72
|
+
item.source_trace = _source_trace(block)
|
|
61
73
|
write_item(item, output_dir)
|
|
62
74
|
total_items += 1
|
|
63
75
|
print(f"drafted: {path} blocks={len(blocks)}")
|
|
@@ -76,15 +88,11 @@ def _list_effective_files(path: Path) -> list[Path]:
|
|
|
76
88
|
|
|
77
89
|
def _confirm_overwrite(
|
|
78
90
|
output_dir: Path,
|
|
79
|
-
approved_dir: Path,
|
|
80
|
-
result_dir: Path,
|
|
81
91
|
existing: list[Path],
|
|
82
92
|
) -> bool:
|
|
83
93
|
print(f"found {len(existing)} existing file(s) in {output_dir}.")
|
|
84
94
|
print("Continuing will delete existing generated files under:")
|
|
85
95
|
print(f"- {output_dir}")
|
|
86
|
-
print(f"- {approved_dir}")
|
|
87
|
-
print(f"- {result_dir}")
|
|
88
96
|
answer = input("Overwrite and continue? [y/N]: ").strip().lower()
|
|
89
97
|
return answer in {"y", "yes"}
|
|
90
98
|
|
|
@@ -96,6 +104,84 @@ def _clear_generated_files(*dirs: Path) -> None:
|
|
|
96
104
|
print(f"deleted: {path}")
|
|
97
105
|
|
|
98
106
|
|
|
107
|
+
def _attach_block_context(
|
|
108
|
+
blocks: List[ParsedBlock],
|
|
109
|
+
context_chars: int,
|
|
110
|
+
outline_max_sections: int,
|
|
111
|
+
) -> List[ParsedBlock]:
|
|
112
|
+
if context_chars <= 0:
|
|
113
|
+
return blocks
|
|
114
|
+
|
|
115
|
+
outline = _document_outline(blocks, outline_max_sections)
|
|
116
|
+
output: List[ParsedBlock] = []
|
|
117
|
+
for idx, block in enumerate(blocks):
|
|
118
|
+
parts = []
|
|
119
|
+
if outline:
|
|
120
|
+
parts.append(f"文档章节目录:\n{outline}")
|
|
121
|
+
if block.category_description:
|
|
122
|
+
parts.append(
|
|
123
|
+
"知识大类说明:\n"
|
|
124
|
+
f"大类:{block.category}\n"
|
|
125
|
+
f"说明:{block.category_description}\n"
|
|
126
|
+
f"关键词:{', '.join(block.category_keywords)}"
|
|
127
|
+
)
|
|
128
|
+
if idx > 0:
|
|
129
|
+
parts.append(
|
|
130
|
+
"上一片段摘要:\n"
|
|
131
|
+
f"章节:{blocks[idx - 1].source_section}\n"
|
|
132
|
+
f"{_compact_context_text(blocks[idx - 1].content, context_chars // 2)}"
|
|
133
|
+
)
|
|
134
|
+
if idx + 1 < len(blocks):
|
|
135
|
+
parts.append(
|
|
136
|
+
"下一片段摘要:\n"
|
|
137
|
+
f"章节:{blocks[idx + 1].source_section}\n"
|
|
138
|
+
f"{_compact_context_text(blocks[idx + 1].content, context_chars // 2)}"
|
|
139
|
+
)
|
|
140
|
+
output.append(ParsedBlock(
|
|
141
|
+
source_doc=block.source_doc,
|
|
142
|
+
source_section=block.source_section,
|
|
143
|
+
content=block.content,
|
|
144
|
+
pages=block.pages,
|
|
145
|
+
order=block.order,
|
|
146
|
+
context="\n\n".join(parts),
|
|
147
|
+
category=block.category,
|
|
148
|
+
category_description=block.category_description,
|
|
149
|
+
category_keywords=block.category_keywords,
|
|
150
|
+
))
|
|
151
|
+
return output
|
|
152
|
+
|
|
153
|
+
|
|
154
|
+
def _document_outline(blocks: List[ParsedBlock], max_sections: int) -> str:
|
|
155
|
+
sections = []
|
|
156
|
+
seen = set()
|
|
157
|
+
for block in blocks:
|
|
158
|
+
section = block.source_section.strip()
|
|
159
|
+
if not section or section in seen:
|
|
160
|
+
continue
|
|
161
|
+
seen.add(section)
|
|
162
|
+
sections.append(f"- {section}")
|
|
163
|
+
if len(sections) >= max_sections:
|
|
164
|
+
remaining = len(blocks) - len(sections)
|
|
165
|
+
if remaining > 0:
|
|
166
|
+
sections.append(f"- ... 其余 {remaining} 个片段")
|
|
167
|
+
break
|
|
168
|
+
return "\n".join(sections)
|
|
169
|
+
|
|
170
|
+
|
|
171
|
+
def _compact_context_text(text: str, limit: int) -> str:
|
|
172
|
+
compact = " ".join(text.split())
|
|
173
|
+
if limit <= 0 or len(compact) <= limit:
|
|
174
|
+
return compact
|
|
175
|
+
return compact[:limit].rstrip() + "..."
|
|
176
|
+
|
|
177
|
+
|
|
178
|
+
def _source_trace(block: ParsedBlock) -> str:
|
|
179
|
+
parts = [f"section={block.source_section}"]
|
|
180
|
+
if block.pages:
|
|
181
|
+
parts.append(f"pages={','.join(map(str, sorted(set(block.pages))))}")
|
|
182
|
+
return "; ".join(parts)
|
|
183
|
+
|
|
184
|
+
|
|
99
185
|
def cmd_validate(args) -> int:
|
|
100
186
|
issues = validate_dir(Path(args.input))
|
|
101
187
|
for issue in issues:
|
|
@@ -104,20 +190,6 @@ def cmd_validate(args) -> int:
|
|
|
104
190
|
return 1 if any(i.level == "error" for i in issues) else 0
|
|
105
191
|
|
|
106
192
|
|
|
107
|
-
def cmd_promote(args) -> int:
|
|
108
|
-
input_dir = Path(args.input)
|
|
109
|
-
result_dir = Path(args.result_dir)
|
|
110
|
-
result_dir.mkdir(parents=True, exist_ok=True)
|
|
111
|
-
count = 0
|
|
112
|
-
for path in sorted(input_dir.rglob("*.md")):
|
|
113
|
-
target = result_dir / path.name
|
|
114
|
-
shutil.copy2(path, target)
|
|
115
|
-
count += 1
|
|
116
|
-
print(f"promoted: {path} -> {target}")
|
|
117
|
-
print(f"done. promoted={count}")
|
|
118
|
-
return 0
|
|
119
|
-
|
|
120
|
-
|
|
121
193
|
def build_parser() -> argparse.ArgumentParser:
|
|
122
194
|
parser = argparse.ArgumentParser(description="Offline document-to-knowledge Markdown generator.")
|
|
123
195
|
sub = parser.add_subparsers(dest="command", required=True)
|
|
@@ -129,21 +201,15 @@ def build_parser() -> argparse.ArgumentParser:
|
|
|
129
201
|
|
|
130
202
|
draft_cmd = sub.add_parser("draft", help="Generate draft knowledge files.")
|
|
131
203
|
draft_cmd.add_argument("--input", default=str(CURRENT_DIR / "input"))
|
|
132
|
-
draft_cmd.add_argument("--output", default=str(CURRENT_DIR / "
|
|
133
|
-
draft_cmd.add_argument("--
|
|
134
|
-
draft_cmd.add_argument("--
|
|
135
|
-
draft_cmd.add_argument("--max-chars", type=int, default=8000)
|
|
204
|
+
draft_cmd.add_argument("--output", default=str(CURRENT_DIR / "result"))
|
|
205
|
+
draft_cmd.add_argument("--status", default="active", choices=["draft", "active"])
|
|
206
|
+
draft_cmd.add_argument("--max-chars", type=int, default=None)
|
|
136
207
|
draft_cmd.set_defaults(func=cmd_draft)
|
|
137
208
|
|
|
138
209
|
validate_cmd = sub.add_parser("validate", help="Validate generated Markdown files.")
|
|
139
|
-
validate_cmd.add_argument("--input", default=str(CURRENT_DIR / "
|
|
210
|
+
validate_cmd.add_argument("--input", default=str(CURRENT_DIR / "result"))
|
|
140
211
|
validate_cmd.set_defaults(func=cmd_validate)
|
|
141
212
|
|
|
142
|
-
promote_cmd = sub.add_parser("promote", help="Copy reviewed files to result.")
|
|
143
|
-
promote_cmd.add_argument("--input", default=str(CURRENT_DIR / "approved"))
|
|
144
|
-
promote_cmd.add_argument("--result-dir", default=str(CURRENT_DIR / "result"))
|
|
145
|
-
promote_cmd.set_defaults(func=cmd_promote)
|
|
146
|
-
|
|
147
213
|
return parser
|
|
148
214
|
|
|
149
215
|
|