kc-beta 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/agent/context.js +8 -4
- package/src/agent/engine.js +65 -9
- package/src/agent/pipelines/initializer.js +53 -8
- package/src/agent/session-state.js +1 -0
- package/src/agent/skill-loader.js +13 -1
- package/src/agent/tools/document-parse.js +104 -21
- package/src/agent/tools/document-search.js +24 -8
- package/src/agent/tools/sandbox-exec.js +16 -5
- package/src/agent/tools/workspace-file.js +47 -20
- package/src/agent/workspace.js +24 -1
- package/src/cli/components.js +8 -1
- package/src/cli/config.js +100 -6
- package/src/cli/index.js +14 -1
- package/src/cli/onboard.js +70 -1
- package/src/config.js +43 -3
- package/src/model-tiers.json +153 -0
- package/src/providers.js +63 -66
- package/template/AGENT.md +20 -0
- package/template/skills/en/meta/compliance-judgment/SKILL.md +10 -42
- package/template/skills/en/meta/document-chunking/SKILL.md +32 -0
- package/template/skills/en/meta/document-parsing/SKILL.md +11 -18
- package/template/skills/en/meta/entity-extraction/SKILL.md +13 -28
- package/template/skills/en/meta/tree-processing/SKILL.md +19 -1
- package/template/skills/en/meta-meta/auto-model-selection/SKILL.md +53 -0
- package/template/skills/en/meta-meta/pdf-review-dashboard/SKILL.md +57 -0
- package/template/skills/en/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +24 -1
- package/template/skills/en/meta-meta/skill-authoring/SKILL.md +6 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +4 -0
- package/template/skills/zh/meta/compliance-judgment/SKILL.md +41 -262
- package/template/skills/zh/meta/document-chunking/SKILL.md +32 -0
- package/template/skills/zh/meta/document-parsing/SKILL.md +65 -132
- package/template/skills/zh/meta/entity-extraction/SKILL.md +68 -230
- package/template/skills/zh/meta/tree-processing/SKILL.md +82 -194
- package/template/skills/zh/meta-meta/auto-model-selection/SKILL.md +51 -0
- package/template/skills/zh/meta-meta/pdf-review-dashboard/SKILL.md +55 -0
- package/template/skills/zh/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +79 -164
- package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +64 -185
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +95 -216
|
@@ -3,274 +3,112 @@ name: entity-extraction
|
|
|
3
3
|
description: Extract specific entities, values, and text segments from documents as required by verification rules. Use after tree processing has located the relevant section, when a rule needs a specific number, date, name, amount, clause, or any domain-specific entity extracted. Covers extraction method selection (regex vs LLM), schema design, postprocessing, and confidence annotation. Also use when designing the extraction step of a workflow for worker LLMs.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
#
|
|
6
|
+
# Entity Extraction
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
An entity is the thing you need to check. A number, a date, a name, a clause, a percentage, a statement. The rule says what to check; extraction is how you get the value to check it against.
|
|
9
9
|
|
|
10
|
-
##
|
|
10
|
+
## Extraction Type Taxonomy
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
Different extraction scenarios call for different approaches:
|
|
13
13
|
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
-
|
|
17
|
-
-
|
|
18
|
-
- 担保方式:抵押+保证
|
|
19
|
-
- 风险披露段落(整段文本)
|
|
20
|
-
- 签字页是否存在(布尔值)
|
|
14
|
+
### Single Entity from Single Section
|
|
15
|
+
The simplest case. One rule needs one value from one place.
|
|
16
|
+
- Example: "Extract the capital adequacy ratio from the Key Metrics table."
|
|
17
|
+
- Approach: Locate the section, apply regex or LLM extraction.
|
|
21
18
|
|
|
22
|
-
|
|
19
|
+
### Multiple Entities from Single Section
|
|
20
|
+
One rule needs several related values from the same place.
|
|
21
|
+
- Example: "Extract the borrower's name, loan amount, interest rate, and maturity date from the loan agreement summary."
|
|
22
|
+
- Approach: Design a single extraction call that returns all values. More efficient than multiple calls.
|
|
23
23
|
|
|
24
|
-
|
|
24
|
+
### Single Entity from Multiple Sections
|
|
25
|
+
One value is scattered across multiple places, or needs cross-referencing.
|
|
26
|
+
- Example: "Extract the total collateral value, which may be listed in the collateral section or in Appendix A."
|
|
27
|
+
- Approach: Collect content from all relevant sections, then extract. Note which source the value came from.
|
|
25
28
|
|
|
26
|
-
|
|
29
|
+
### Entity from Full Document
|
|
30
|
+
The value could be anywhere, or the rule applies to the document as a whole.
|
|
31
|
+
- Example: "Check whether the document contains a valid signature page."
|
|
32
|
+
- Approach: For the coding agent, scan the full document. For worker LLM workflows, design a two-pass approach: first pass identifies the location, second pass extracts the value.
|
|
27
33
|
|
|
28
|
-
|
|
34
|
+
## Method Selection
|
|
29
35
|
|
|
30
|
-
|
|
36
|
+
Extraction method selection is a cost-accuracy search. The goal is finding the cheapest method that meets the accuracy threshold. Regex is the smallest, cheapest "model" — zero cost, instant, deterministic. Worker LLM is more capable but costs tokens and time. Any search strategy is valid: try the cheapest first and escalate, try the most capable first and downgrade, bisect, or jump directly to a known-good method based on past experience in AGENT.md.
|
|
31
37
|
|
|
32
|
-
|
|
33
|
-
- **方法**:定位到章节,用正则或 LLM 提取。
|
|
34
|
-
- **这是最常见的情况**,优先为此场景优化工作流。
|
|
38
|
+
### Available Methods
|
|
35
39
|
|
|
36
|
-
|
|
40
|
+
**Regex / Python** — Cost: zero. Speed: instant. Deterministic.
|
|
41
|
+
Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
|
|
37
42
|
|
|
38
|
-
|
|
43
|
+
**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding.
|
|
44
|
+
Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
|
|
39
45
|
|
|
40
|
-
|
|
41
|
-
- **方法**:设计一次提取调用返回所有值。比分别调用更高效,也更容易保持值之间的关系一致性。
|
|
42
|
-
- **注意**:如果用 LLM 提取,在提示词中一次性要求所有字段。如果用正则,对同一文本段逐个匹配。
|
|
46
|
+
Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
|
|
43
47
|
|
|
44
|
-
###
|
|
48
|
+
### The Search
|
|
45
49
|
|
|
46
|
-
|
|
50
|
+
If a method's results fall below the accuracy threshold, try a different method or a more capable model. If regex works and meets accuracy — keep it, it's free. If regex produces results below threshold, escalate to worker LLM. If a cheap worker LLM isn't accurate enough, try a more capable tier. Record what works for each extraction type in AGENT.md for future reference.
|
|
47
51
|
|
|
48
|
-
|
|
49
|
-
- **方法**:从所有相关章节收集内容,然后统一提取。记录值的来源位置——如果同一实体在不同位置出现不同值,这本身就是一个需要判定的问题。
|
|
50
|
-
- **金融文档中的典型场景**:净利润可能出现在"主要财务指标"、"利润表"、"董事长致辞"等多处。
|
|
52
|
+
## Schema Design
|
|
51
53
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
值可能在任何位置,或规则适用于整份文档。
|
|
55
|
-
|
|
56
|
-
- **示例**:"检查文档是否包含有效的签字页。"
|
|
57
|
-
- **方法**:
|
|
58
|
-
- 编程智能体执行时:扫描全文。
|
|
59
|
-
- Worker LLM(执行模型)工作流:设计两遍扫描——第一遍定位候选位置,第二遍精确提取。
|
|
60
|
-
- **成本较高**,尽量避免。如果能通过文档树缩小范围(签字页通常在文末),优先使用树导航。
|
|
61
|
-
|
|
62
|
-
## 方法选择
|
|
63
|
-
|
|
64
|
-
### 正则/Python(成本:零,速度:瞬时)
|
|
65
|
-
|
|
66
|
-
当实体具有可预测的格式时,优先使用正则表达式。
|
|
67
|
-
|
|
68
|
-
**日期提取**:
|
|
69
|
-
```python
|
|
70
|
-
# 中文日期格式
|
|
71
|
-
r'\d{4}年\d{1,2}月\d{1,2}日'
|
|
72
|
-
# ISO 格式
|
|
73
|
-
r'\d{4}[-/]\d{1,2}[-/]\d{1,2}'
|
|
74
|
-
# 混合格式
|
|
75
|
-
r'\d{4}年?\d{1,2}月?\d{1,2}日?'
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
**金额提取**:
|
|
79
|
-
```python
|
|
80
|
-
# 数字金额
|
|
81
|
-
r'[\d,]+\.?\d*\s*(?:元|万元|亿元|百万元)'
|
|
82
|
-
# 大写金额
|
|
83
|
-
r'人民币[壹贰叁肆伍陆柒捌玖拾佰仟万亿零]+(?:元整?)'
|
|
84
|
-
# 带币种标记
|
|
85
|
-
r'(?:人民币|美元|港币|EUR|USD)\s*[\d,.]+\s*(?:元|万元|亿元)?'
|
|
86
|
-
```
|
|
87
|
-
|
|
88
|
-
**百分比提取**:
|
|
89
|
-
```python
|
|
90
|
-
# 标准百分比
|
|
91
|
-
r'\d+\.?\d*\s*[%%]'
|
|
92
|
-
# 基点表示
|
|
93
|
-
r'\d+\.?\d*\s*(?:个基点|BP|bps)'
|
|
94
|
-
# 千分比(部分监管指标使用)
|
|
95
|
-
r'\d+\.?\d*\s*[‰]'
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
**监管编号提取**:
|
|
99
|
-
```python
|
|
100
|
-
# 银保监会发文编号
|
|
101
|
-
r'银保监发〔\d{4}〕\d+号'
|
|
102
|
-
# 证监会发文编号
|
|
103
|
-
r'证监[a-z]*〔\d{4}〕\d+号'
|
|
104
|
-
# 统一社会信用代码
|
|
105
|
-
r'[0-9A-Z]{18}'
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
好的正则表达式比好的 LLM 提示词更适合结构化值——更快、确定性、免费。在样本文档上构建和测试正则,确认覆盖率后再部署。
|
|
109
|
-
|
|
110
|
-
### LLM 提取(成本:API 调用,速度:秒级)
|
|
111
|
-
|
|
112
|
-
当实体需要语义理解时使用 LLM:
|
|
113
|
-
|
|
114
|
-
- **上下文相关的实体**:"担保人的主要经营业务"——需要理解谁是担保人、哪段文字描述了其业务。
|
|
115
|
-
- **条件性值**:"含调整后的利率"——需要理解什么构成调整。
|
|
116
|
-
- **语义匹配**:"充分的风险披露"——需要判断哪些文本构成风险披露。
|
|
117
|
-
- **不规则表格**:表格结构不统一,正则无法可靠提取单元格。
|
|
118
|
-
- **隐含信息**:"是否提及了流动性风险"——可能不是一个明确的章节标题,而是散布在多处的讨论。
|
|
119
|
-
|
|
120
|
-
设计 LLM 提取提示词的要点:
|
|
121
|
-
1. 包含缩窄后的上下文(来自文档树处理)。
|
|
122
|
-
2. 精确说明要提取什么,不要模糊。
|
|
123
|
-
3. 定义输出格式(JSON,含命名字段)。
|
|
124
|
-
4. 如果提取对象不直观,提供一个示例。
|
|
125
|
-
5. 明确要求:如果找不到,返回 null 而不是猜测。
|
|
126
|
-
|
|
127
|
-
### 混合方法
|
|
128
|
-
|
|
129
|
-
最常用的实际策略:
|
|
130
|
-
|
|
131
|
-
1. 先用正则提取候选值(快速,捕获明显匹配)。
|
|
132
|
-
2. 如果正则找到高置信度匹配,直接使用。
|
|
133
|
-
3. 如果正则失败或不确定,回退到 LLM 提取。
|
|
134
|
-
4. 在置信度要求高的场景,用 LLM 验证正则结果。
|
|
135
|
-
|
|
136
|
-
混合方法兼顾了成本和准确率。90% 的提取用正则完成(免费),10% 的困难情况用 LLM 兜底。
|
|
137
|
-
|
|
138
|
-
## 数据模式设计
|
|
139
|
-
|
|
140
|
-
为每次提取定义期望输出。保持简单,按需扩展(JIT 原则):
|
|
54
|
+
Define the expected output for each extraction. Keep it simple and JIT:
|
|
141
55
|
|
|
142
56
|
```json
|
|
143
57
|
{
|
|
144
58
|
"entity_name": "capital_adequacy_ratio",
|
|
145
59
|
"value": 12.5,
|
|
146
60
|
"unit": "%",
|
|
147
|
-
"raw_text": "
|
|
148
|
-
"source_location": "
|
|
149
|
-
"confidence": 0.
|
|
61
|
+
"raw_text": "资本充足率为12.5%",
|
|
62
|
+
"source_location": "Chapter 2, Table 1, Row 3",
|
|
63
|
+
"confidence": 0.95,
|
|
150
64
|
"extraction_method": "regex"
|
|
151
65
|
}
|
|
152
66
|
```
|
|
153
67
|
|
|
154
|
-
|
|
155
|
-
- **value
|
|
156
|
-
- **unit
|
|
157
|
-
- **raw_text
|
|
158
|
-
- **source_location
|
|
159
|
-
- **confidence
|
|
160
|
-
- **extraction_method
|
|
161
|
-
|
|
162
|
-
不要过度设计模式。在测试过程中发现需要新字段时再添加。
|
|
163
|
-
|
|
164
|
-
## 后处理
|
|
165
|
-
|
|
166
|
-
原始提取值通常需要标准化才能用于判定:
|
|
167
|
-
|
|
168
|
-
### 中文数字转换
|
|
169
|
-
```python
|
|
170
|
-
# 中文大写 → 数字
|
|
171
|
-
"壹仟贰佰叁拾肆万伍仟陆佰柒拾捌元" → 12345678
|
|
172
|
-
"叁拾伍亿零贰仟万元" → 3520000000
|
|
173
|
-
|
|
174
|
-
# 中文小写 → 数字
|
|
175
|
-
"一百二十万" → 1200000
|
|
176
|
-
"三千五百" → 3500
|
|
177
|
-
"十二点五" → 12.5
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
这在金融文档中极为常见。贷款合同几乎总是用大写数字书写金额。建一个可靠的中文数字转换函数放在工具库中。
|
|
181
|
-
|
|
182
|
-
### 日期标准化
|
|
183
|
-
```python
|
|
184
|
-
"2024年3月15日" → "2024-03-15"
|
|
185
|
-
"二〇二四年三月十五日" → "2024-03-15"
|
|
186
|
-
"2024/03/15" → "2024-03-15"
|
|
187
|
-
"2024.3.15" → "2024-03-15"
|
|
188
|
-
```
|
|
189
|
-
|
|
190
|
-
### 单位换算
|
|
191
|
-
```python
|
|
192
|
-
# 统一到基本单位进行比较
|
|
193
|
-
"1,500万元" → 15000000 元
|
|
194
|
-
"3.5亿元" → 350000000 元
|
|
195
|
-
"150个基点" → 1.5%
|
|
196
|
-
"0.125" → 12.5% # 小数表示的百分比
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
单位换算要特别小心。"万元"和"元"差四个数量级——一个换算错误可能让 1500 万的贷款变成 1500 元,核查结果完全失真。在后处理代码中加入合理性检查。
|
|
200
|
-
|
|
201
|
-
### 格式清理
|
|
202
|
-
```python
|
|
203
|
-
# 去除千分位分隔符
|
|
204
|
-
"12,345,678" → "12345678"
|
|
205
|
-
# 去除多余空格和换行
|
|
206
|
-
"资本充足率\n为 12.5 %" → "资本充足率为12.5%"
|
|
207
|
-
# 全角转半角
|
|
208
|
-
"12.5%" → "12.5%"
|
|
209
|
-
```
|
|
68
|
+
The schema should capture:
|
|
69
|
+
- **value**: The extracted value, normalized.
|
|
70
|
+
- **unit**: If applicable (%, 元, days, etc.).
|
|
71
|
+
- **raw_text**: The original text fragment where the value was found. This is evidence for the judgment step.
|
|
72
|
+
- **source_location**: Where in the document the value was found.
|
|
73
|
+
- **confidence**: How sure you are (see `confidence-system`).
|
|
74
|
+
- **extraction_method**: What extracted it (regex, LLM-TIER2, etc.).
|
|
210
75
|
|
|
211
|
-
|
|
76
|
+
Do not over-engineer the schema. Add fields as needed during testing.
|
|
212
77
|
|
|
213
|
-
##
|
|
78
|
+
## Postprocessing
|
|
214
79
|
|
|
215
|
-
|
|
80
|
+
Raw extracted values often need normalization:
|
|
216
81
|
|
|
217
|
-
|
|
82
|
+
- **Chinese numerals → digits**: 一百二十万 → 1200000
|
|
83
|
+
- **Date standardization**: 2024年3月15日 → 2024-03-15
|
|
84
|
+
- **Unit conversion**: 万元 → multiply by 10000 if comparing to a threshold in 元.
|
|
85
|
+
- **Whitespace and noise removal**: Strip extra spaces, line breaks, formatting artifacts.
|
|
86
|
+
- **Percentage normalization**: 0.125 → 12.5% or vice versa, depending on what the rule expects.
|
|
218
87
|
|
|
219
|
-
|
|
220
|
-
|---------|-----------|------|
|
|
221
|
-
| 正则匹配+格式验证 | 0.90-0.95 | 格式对了,值大概率对 |
|
|
222
|
-
| LLM 提取,高确定性 | 0.80-0.85 | 模型明确找到了值 |
|
|
223
|
-
| LLM 提取,有歧义 | 0.60-0.75 | 模型不太确定或有多个候选 |
|
|
224
|
-
| 回退/推断值 | 0.40-0.60 | 非直接提取,有猜测成分 |
|
|
225
|
-
| 未找到值 | 0.0 | 标记为 MISSING |
|
|
88
|
+
Build postprocessing as Python functions in the rule skill's `scripts/` directory. They are deterministic and reusable.
|
|
226
89
|
|
|
227
|
-
|
|
90
|
+
## Confidence Annotation
|
|
228
91
|
|
|
229
|
-
|
|
92
|
+
Every extraction should carry a confidence estimate:
|
|
230
93
|
|
|
231
|
-
-
|
|
232
|
-
-
|
|
233
|
-
-
|
|
234
|
-
-
|
|
94
|
+
- **Regex match, validated format**: 0.90-0.95
|
|
95
|
+
- **LLM extraction, high certainty**: 0.80-0.85
|
|
96
|
+
- **LLM extraction, some ambiguity**: 0.60-0.75
|
|
97
|
+
- **Fallback or inferred value**: 0.40-0.60
|
|
98
|
+
- **No value found**: 0.0 (flag as MISSING)
|
|
235
99
|
|
|
236
|
-
|
|
100
|
+
These are starting points. Calibrate based on actual accuracy (see `confidence-system`).
|
|
237
101
|
|
|
238
|
-
|
|
102
|
+
## Prompt Design: Ask For What You Want
|
|
239
103
|
|
|
240
|
-
|
|
241
|
-
2. **可用文档内容空间** = 模型上下文窗口 - N - 回复预留。
|
|
242
|
-
3. 如果章节内容超出可用空间,通过文档树进一步缩小范围。
|
|
243
|
-
4. 始终为模型回复留出足够空间(至少 1K-2K tokens)。
|
|
244
|
-
5. **用实际模型测试**——编程智能体的 token 计数可能与Worker LLM 的分词器不同。中文文本在不同分词器间的 token 数差异可达 30%。
|
|
104
|
+
Design prompts for what you want, not against what you don't want. "Don't include explanations" in a prompt is less reliable than stripping non-JSON text from the output in postprocessing. If you need to tell the LLM not to do something, use output filtering instead of prompt negation.
|
|
245
105
|
|
|
246
|
-
|
|
106
|
+
## Fitting Worker LLM Context
|
|
247
107
|
|
|
248
|
-
|
|
249
|
-
你是一个金融文档实体提取助手。
|
|
250
|
-
|
|
251
|
-
任务:从以下文档内容中提取指定实体。
|
|
252
|
-
|
|
253
|
-
要提取的实体:{entity_description}
|
|
254
|
-
|
|
255
|
-
文档位置:{document_path}
|
|
256
|
-
|
|
257
|
-
文档内容:
|
|
258
|
-
---
|
|
259
|
-
{section_content}
|
|
260
|
-
---
|
|
261
|
-
|
|
262
|
-
输出格式(JSON):
|
|
263
|
-
{
|
|
264
|
-
"value": <提取的值>,
|
|
265
|
-
"unit": "<单位>",
|
|
266
|
-
"raw_text": "<原文中包含该值的完整句子>",
|
|
267
|
-
"found": true/false
|
|
268
|
-
}
|
|
269
|
-
|
|
270
|
-
注意:
|
|
271
|
-
- 如果找不到该实体,将 found 设为 false,value 设为 null。
|
|
272
|
-
- 不要猜测或推断,只提取文档中明确存在的信息。
|
|
273
|
-
- raw_text 必须是文档中的原文,不要改写。
|
|
274
|
-
```
|
|
108
|
+
When designing extraction for worker LLM workflows:
|
|
275
109
|
|
|
276
|
-
|
|
110
|
+
1. Calculate the prompt size: system prompt + instructions + examples + output format = N tokens.
|
|
111
|
+
2. Available context for document content = model's context window - N.
|
|
112
|
+
3. If the section exceeds available context, narrow further via tree processing.
|
|
113
|
+
4. Always leave room for the model's response.
|
|
114
|
+
5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
|
|
@@ -1,233 +1,121 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: tree-processing
|
|
3
|
-
description:
|
|
3
|
+
description: >
|
|
4
|
+
Design production-grade document chunking mechanisms for verification workflows. Use when
|
|
5
|
+
building the chunking step of a workflow that will run repeatedly on many documents.
|
|
6
|
+
The approach: observe sample documents, find structural patterns, write a chunking script
|
|
7
|
+
in code, that script runs in production. Also use for navigating large documents via
|
|
8
|
+
hierarchical structure when a rule targets a specific section.
|
|
9
|
+
For quick, cheap batch chunking during exploration, use document-chunking instead.
|
|
4
10
|
---
|
|
5
11
|
|
|
6
|
-
#
|
|
12
|
+
# Tree Processing
|
|
7
13
|
|
|
8
|
-
|
|
14
|
+
Most verification rules do not need the entire document. They need a specific section, a specific table, a specific disclosure. The tree is your map for navigating large documents efficiently.
|
|
9
15
|
|
|
10
|
-
##
|
|
16
|
+
## Production Chunking Methodology
|
|
11
17
|
|
|
12
|
-
|
|
18
|
+
For verification workflows that process many documents, the chunking mechanism must be precise, consistent, and fast. The approach:
|
|
13
19
|
|
|
14
|
-
1.
|
|
20
|
+
1. **Observe**: Read 3-5 sample documents. Note their structure — headers, numbering, section patterns.
|
|
21
|
+
2. **Find patterns**: Identify what's consistent (header format, numbering convention, TOC structure).
|
|
22
|
+
3. **Write code**: Design a chunking script (regex-based splitter, header detector, TOC parser) that captures the pattern.
|
|
23
|
+
4. **Test**: Run the script on samples. Verify it produces correct, consistent chunks.
|
|
24
|
+
5. **Deploy**: The script runs in production workflows. It's deterministic, free, and fast.
|
|
15
25
|
|
|
16
|
-
|
|
26
|
+
This is different from `document-chunking` (quick, cheap splits for exploration). Production chunking is a one-time design effort that pays off across all documents of the same type.
|
|
17
27
|
|
|
18
|
-
|
|
28
|
+
## Why Trees
|
|
19
29
|
|
|
20
|
-
|
|
30
|
+
Two reasons:
|
|
21
31
|
|
|
22
|
-
|
|
32
|
+
1. **Rules have scope.** "The risk disclosure in Chapter 5 must contain..." — you need to find Chapter 5, not read 1000 pages.
|
|
33
|
+
2. **Worker LLMs have limits.** A 16K-32K context window cannot hold a 1000-page document. You must narrow to the relevant section.
|
|
23
34
|
|
|
24
|
-
|
|
35
|
+
The tree structure solves both: it tells you WHERE things are, and lets you extract JUST what you need.
|
|
25
36
|
|
|
26
|
-
|
|
27
|
-
- 中文监管文档常见:`第一章`、`第二章`……`第一节`、`第二节`……
|
|
28
|
-
- 中文金融报告常见:`一、`、`二、`……`(一)`、`(二)`……`1.`、`2.`……
|
|
29
|
-
- 英文文档:`Chapter X`、`Part X`、`Section X.X`
|
|
30
|
-
- 编号体系:`1.1.2`、`Article 3`、`(a)(i)`
|
|
31
|
-
- **视觉标记**:加粗文本、字号变大、分页符、横线分隔。这些在解析后的文本中可能表现为空行或特殊格式。
|
|
32
|
-
- **目录(TOC)**:大多数正式金融文档都有目录。目录就是文档自己提供的树结构,包含标题和页码。
|
|
37
|
+
## Building the Tree
|
|
33
38
|
|
|
34
|
-
|
|
39
|
+
### Step 1: Discover the Structure
|
|
35
40
|
|
|
36
|
-
|
|
41
|
+
Before building a tree parser, explore several sample documents to find structural patterns. Look for:
|
|
37
42
|
|
|
38
|
-
|
|
43
|
+
- **Header conventions**: Do chapters start with "Chapter X"? "第X章"? "Part X"? A Roman numeral?
|
|
44
|
+
- **Numbering systems**: "1.1.2", "Article 3", "(a)(i)", hierarchical numbering?
|
|
45
|
+
- **Visual markers**: Bold text, larger font, horizontal rules, page breaks before chapters?
|
|
46
|
+
- **Table of contents**: Most formal documents have one. It is the document's own tree.
|
|
39
47
|
|
|
40
|
-
|
|
48
|
+
Spend time here. The patterns you find determine whether the tree builder is a simple regex or a complex parser.
|
|
41
49
|
|
|
42
|
-
|
|
43
|
-
# 章标题:第一章、第二章……
|
|
44
|
-
r'^第[一二三四五六七八九十百千]+章\s'
|
|
50
|
+
### Step 2: Choose the Parser
|
|
45
51
|
|
|
46
|
-
|
|
47
|
-
|
|
52
|
+
**If patterns are consistent** (they usually are in regulated documents):
|
|
53
|
+
- Write a regex-based splitter. For example:
|
|
54
|
+
- `^第[一二三四五六七八九十百千]+章` for Chinese chapter headers
|
|
55
|
+
- `^Chapter \d+` for English
|
|
56
|
+
- `^\d+\.\d+(\.\d+)*\s` for numbered sections
|
|
57
|
+
- This is fast, deterministic, and reliable. Prefer this when it works.
|
|
48
58
|
|
|
49
|
-
|
|
50
|
-
|
|
59
|
+
**If patterns are inconsistent or absent**:
|
|
60
|
+
- Use the LLM-guided wedge-driving approach (see `rule-extraction/references/chunking-strategies.md` for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
|
|
61
|
+
- This is slower and costs LLM calls, but handles unstructured documents. The rolling window means even very large unstructured leaf nodes can be chunked incrementally.
|
|
51
62
|
|
|
52
|
-
|
|
53
|
-
|
|
63
|
+
**If the document has a table of contents**:
|
|
64
|
+
- Parse the TOC first. It gives you the tree structure and page numbers for free.
|
|
65
|
+
- Then use the TOC-derived structure to split the document body.
|
|
54
66
|
|
|
55
|
-
|
|
56
|
-
r'^\d+\.\d+(\.\d+)*\s'
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
正则分割器速度快、结果确定、可靠性高。能用正则就用正则。
|
|
60
|
-
|
|
61
|
-
**如果模式不一致或缺失**:
|
|
62
|
-
|
|
63
|
-
使用 LLM 辅助的楔入式分割(参见 `rule-extraction/references/chunking-strategies.md`,其中详述了滚动上下文窗口、K-token 引用、Levenshtein 模糊匹配等完整算法)。这种方法速度慢、消耗 API 调用,但能处理结构不清晰的文档。滚动窗口机制意味着即使非常大的无结构叶节点也可以增量分块。
|
|
64
|
-
|
|
65
|
-
**如果文档有目录**:
|
|
66
|
-
|
|
67
|
-
优先解析目录。目录直接给出了树结构和页码,省去了你自己构建的工作:
|
|
68
|
-
- 从目录中提取标题文本、层级关系、页码。
|
|
69
|
-
- 用目录得到的结构去分割文档正文。
|
|
70
|
-
- 验证目录与正文是否一致(偶尔会有目录与实际内容不符的情况)。
|
|
71
|
-
|
|
72
|
-
### 第三步:构建树
|
|
73
|
-
|
|
74
|
-
文档树是一个简单的嵌套结构:
|
|
75
|
-
|
|
76
|
-
```
|
|
77
|
-
年度报告
|
|
78
|
-
├── 第一章 公司概况(第1-20页)
|
|
79
|
-
│ ├── 第一节 基本信息
|
|
80
|
-
│ └── 第二节 主要业务
|
|
81
|
-
├── 第二章 财务数据(第21-80页)
|
|
82
|
-
│ ├── 第一节 主要财务指标
|
|
83
|
-
│ │ ├── 一、资本充足率
|
|
84
|
-
│ │ └── 二、资产质量指标
|
|
85
|
-
│ ├── 第二节 资产负债表
|
|
86
|
-
│ └── 第三节 利润表
|
|
87
|
-
├── 第三章 风险管理(第81-150页)
|
|
88
|
-
│ ├── 第一节 信用风险
|
|
89
|
-
│ ├── 第二节 市场风险
|
|
90
|
-
│ └── 第三节 操作风险
|
|
91
|
-
└── 附录
|
|
92
|
-
├── 附录一 关联交易明细
|
|
93
|
-
└── 附录二 监管指标计算说明
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
每个节点存储:
|
|
97
|
-
- **标题文本**:节点的标题。
|
|
98
|
-
- **层级**:在树中的深度(章=1,节=2,条目=3……)。
|
|
99
|
-
- **位置**:在解析文本中的起止位置(字符偏移或行号)。
|
|
100
|
-
- **内容大小**:该节点内容的 token 数或字符数。这决定了能否直接喂给 Worker LLM。
|
|
101
|
-
|
|
102
|
-
### 第四步:使用树
|
|
103
|
-
|
|
104
|
-
当规则说"检查风险披露章节":
|
|
105
|
-
|
|
106
|
-
1. **搜索树节点**:将规则的作用域描述与节点标题匹配。
|
|
107
|
-
- 精确匹配:"第三章"→ 找到标题为"第三章 风险管理"的节点。
|
|
108
|
-
- 语义匹配:"风险披露相关内容"→ 需要模糊匹配或 LLM 分类,找到内容与风险披露相关的节点。
|
|
109
|
-
- 关键词匹配:对标题做关键词搜索,如"风险"+"披露"。
|
|
110
|
-
|
|
111
|
-
2. **提取内容**:获取该节点(及其子节点)的完整内容。
|
|
112
|
-
|
|
113
|
-
3. **检查大小**:内容是否能放进 Worker LLM 的上下文窗口?
|
|
114
|
-
- 能放下 → 直接使用。
|
|
115
|
-
- 放不下 → 下探到子节点,找到规则所需的具体小节。
|
|
116
|
-
|
|
117
|
-
## 洋葱剥皮法
|
|
118
|
-
|
|
119
|
-
这是处理大型文档的核心策略:像剥洋葱一样,一层一层缩小范围,仅在超出大小限制时才进行下一层分割。
|
|
120
|
-
|
|
121
|
-
### 具体操作
|
|
122
|
-
|
|
123
|
-
1. **最外层**:整份文档。如果文档在上下文窗口内(罕见),直接使用。
|
|
124
|
-
2. **第一层剥离**:按最高级标题(章)分割。大多数章节在 16K-32K 范围内。
|
|
125
|
-
3. **第二层剥离**:如果某章仍然过大,按次级标题(节)分割。
|
|
126
|
-
4. **继续剥离**:必要时继续,但通常两到三层就够了。
|
|
127
|
-
|
|
128
|
-
关键原则:**不要预先把文档切碎**。只在某个节点太大时才继续分割。过度分割会丢失上下文——一个段落脱离了它所在的章节,可能会被误解。
|
|
67
|
+
### Step 3: Build the Tree
|
|
129
68
|
|
|
130
|
-
|
|
69
|
+
The tree is a simple nested structure:
|
|
131
70
|
|
|
132
|
-
无论下探到多深,始终在提取的内容前附加父节点链:
|
|
133
|
-
|
|
134
|
-
```
|
|
135
|
-
[文档标题] > 第二章 财务数据 > 第一节 主要财务指标 > 一、资本充足率
|
|
136
|
-
|
|
137
|
-
(以下是该节内容)
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
这让 Worker LLM 知道当前内容在文档中的位置,避免脱离上下文的误判。
|
|
141
|
-
|
|
142
|
-
## 中文文档结构模式
|
|
143
|
-
|
|
144
|
-
不同类型的金融文档有不同的结构习惯:
|
|
145
|
-
|
|
146
|
-
### 监管文件(银保监会、证监会发文)
|
|
147
|
-
```
|
|
148
|
-
第一章 总则
|
|
149
|
-
第一条、第二条、第三条……
|
|
150
|
-
第二章 ……
|
|
151
|
-
第X条……
|
|
152
|
-
附则
|
|
153
|
-
```
|
|
154
|
-
特点:条文编号连续,跨章不重置。
|
|
155
|
-
|
|
156
|
-
### 上市公司年度报告
|
|
157
71
|
```
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
72
|
+
Document
|
|
73
|
+
├── Part I: General Provisions
|
|
74
|
+
│ ├── Chapter 1: Definitions (pages 1-15)
|
|
75
|
+
│ └── Chapter 2: Scope (pages 16-22)
|
|
76
|
+
├── Part II: Capital Requirements
|
|
77
|
+
│ ├── Chapter 3: Minimum Capital (pages 23-45)
|
|
78
|
+
│ │ ├── Section 3.1: Tier 1 Capital
|
|
79
|
+
│ │ └── Section 3.2: Tier 2 Capital
|
|
80
|
+
│ └── Chapter 4: Risk Weighting (pages 46-78)
|
|
81
|
+
└── Part III: Disclosure
|
|
82
|
+
└── Chapter 5: Risk Disclosure (pages 79-120)
|
|
163
83
|
```
|
|
164
|
-
特点:以"节"为最高级别,内部用"一、二、三"编号。
|
|
165
84
|
|
|
166
|
-
|
|
167
|
-
```
|
|
168
|
-
一、核心概况
|
|
169
|
-
二、资本充足率
|
|
170
|
-
(一)资本构成
|
|
171
|
-
(二)风险加权资产
|
|
172
|
-
三、杠杆率
|
|
173
|
-
……
|
|
174
|
-
```
|
|
175
|
-
特点:用中文数字编号,层级较浅。
|
|
176
|
-
|
|
177
|
-
### 贷款合同
|
|
178
|
-
```
|
|
179
|
-
第一条 借款金额
|
|
180
|
-
第二条 借款期限
|
|
181
|
-
第三条 借款利率
|
|
182
|
-
……
|
|
183
|
-
第X条 违约责任
|
|
184
|
-
```
|
|
185
|
-
特点:扁平结构,每条独立。
|
|
186
|
-
|
|
187
|
-
根据文档类型选择对应的正则模式。
|
|
85
|
+
Each node stores: the header text, the level, the start/end positions in the document, and the content size (in tokens or characters).
|
|
188
86
|
|
|
189
|
-
|
|
87
|
+
### Step 4: Use the Tree
|
|
190
88
|
|
|
191
|
-
|
|
89
|
+
Given a rule that says "check the risk disclosure section":
|
|
192
90
|
|
|
193
|
-
1.
|
|
194
|
-
|
|
195
|
-
|
|
91
|
+
1. **Search the tree** for the relevant node. Match the rule's scope description against node headers.
|
|
92
|
+
- Exact match: "Chapter 5" → find node with "Chapter 5" header.
|
|
93
|
+
- Semantic match: "risk disclosure section" → find node whose header or content relates to risk disclosure. May need fuzzy matching or LLM classification.
|
|
94
|
+
2. **Extract the content** of that node (and optionally its children).
|
|
95
|
+
3. **Check the size.** If the content fits in the worker LLM's context window, use it directly. If not, descend to child nodes and find the specific subsection needed.
|
|
196
96
|
|
|
197
|
-
|
|
198
|
-
- 章节内容 + 提取提示词 + 输出格式说明 = 必须在上下文窗口内。
|
|
199
|
-
- 留出足够空间给模型的回复(通常 1K-2K token)。
|
|
200
|
-
- 如果章节仍然过大,继续在树中下探。
|
|
97
|
+
## The Full Context → Chapter → Entity Pipeline
|
|
201
98
|
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
为 Worker LLM 设计树导航时,做好 token 预算:
|
|
205
|
-
|
|
206
|
-
```
|
|
207
|
-
总上下文窗口:32,768 tokens
|
|
208
|
-
- 系统提示词: ~2,000 tokens
|
|
209
|
-
- 提取指令: ~500 tokens
|
|
210
|
-
- 输出格式说明: ~300 tokens
|
|
211
|
-
- 示例(可选): ~500 tokens
|
|
212
|
-
- 回复预留: ~2,000 tokens
|
|
213
|
-
---
|
|
214
|
-
可用于文档内容:~27,468 tokens(约 18,000-20,000 中文字符)
|
|
215
|
-
```
|
|
99
|
+
This is the standard narrowing funnel for extracting entities for verification:
|
|
216
100
|
|
|
217
|
-
|
|
101
|
+
1. **Full context**: Use the tree to understand the document structure. Know where everything is.
|
|
102
|
+
2. **Chapter**: Navigate to the specific section that the rule targets. Extract its content.
|
|
103
|
+
3. **Entity**: Within the chapter content, extract the specific entity (number, text, clause) using the techniques from `entity-extraction`.
|
|
218
104
|
|
|
219
|
-
|
|
105
|
+
For worker LLMs with 16K-32K context:
|
|
106
|
+
- The chapter content + the extraction prompt must fit in the context window.
|
|
107
|
+
- If a chapter is too large, descend further in the tree.
|
|
108
|
+
- Always include the parent header chain for context: "Part II > Chapter 3 > Section 3.1" so the LLM knows where this content sits in the document.
|
|
220
109
|
|
|
221
|
-
|
|
110
|
+
## Caching and Reuse
|
|
222
111
|
|
|
223
|
-
|
|
224
|
-
-
|
|
225
|
-
-
|
|
112
|
+
Build the tree once per document, reuse across all rules:
|
|
113
|
+
- Save the tree structure as JSON alongside the parsed document.
|
|
114
|
+
- Multiple rules may need different sections of the same document. The tree lets each rule navigate directly to its section without re-parsing.
|
|
226
115
|
|
|
227
|
-
##
|
|
116
|
+
## Edge Cases
|
|
228
117
|
|
|
229
|
-
-
|
|
230
|
-
-
|
|
231
|
-
-
|
|
232
|
-
-
|
|
233
|
-
- **目录与正文不一致**:偶尔目录列出的章节在正文中不存在(或反过来)。以正文为准,但记录差异。
|
|
118
|
+
- **Flat documents**: Some documents have no structural hierarchy. Treat the entire document as one node. Use LLM-guided chunking if it exceeds the context window.
|
|
119
|
+
- **Deeply nested structures**: Some legal documents have 6+ nesting levels. Build all levels but typically only navigate 2-3 levels deep for any given rule.
|
|
120
|
+
- **Cross-section references**: A section might reference "as defined in Section 1.2." When extracting, you may need content from multiple tree nodes. Collect them into a single context for the LLM.
|
|
121
|
+
- **Appendices and annexes**: Often contain critical tables and data. Include them as top-level nodes in the tree.
|