kc-beta 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/agent/context.js +8 -4
- package/src/agent/engine.js +65 -9
- package/src/agent/pipelines/initializer.js +53 -8
- package/src/agent/session-state.js +1 -0
- package/src/agent/skill-loader.js +13 -1
- package/src/agent/tools/document-parse.js +104 -21
- package/src/agent/tools/document-search.js +24 -8
- package/src/agent/tools/sandbox-exec.js +16 -5
- package/src/agent/tools/workspace-file.js +47 -20
- package/src/agent/workspace.js +24 -1
- package/src/cli/components.js +8 -1
- package/src/cli/config.js +100 -6
- package/src/cli/index.js +14 -1
- package/src/cli/onboard.js +70 -1
- package/src/config.js +43 -3
- package/src/model-tiers.json +153 -0
- package/src/providers.js +63 -66
- package/template/AGENT.md +20 -0
- package/template/skills/en/meta/compliance-judgment/SKILL.md +10 -42
- package/template/skills/en/meta/document-chunking/SKILL.md +32 -0
- package/template/skills/en/meta/document-parsing/SKILL.md +11 -18
- package/template/skills/en/meta/entity-extraction/SKILL.md +13 -28
- package/template/skills/en/meta/tree-processing/SKILL.md +19 -1
- package/template/skills/en/meta-meta/auto-model-selection/SKILL.md +53 -0
- package/template/skills/en/meta-meta/pdf-review-dashboard/SKILL.md +57 -0
- package/template/skills/en/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +24 -1
- package/template/skills/en/meta-meta/skill-authoring/SKILL.md +6 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +4 -0
- package/template/skills/zh/meta/compliance-judgment/SKILL.md +41 -262
- package/template/skills/zh/meta/document-chunking/SKILL.md +32 -0
- package/template/skills/zh/meta/document-parsing/SKILL.md +65 -132
- package/template/skills/zh/meta/entity-extraction/SKILL.md +68 -230
- package/template/skills/zh/meta/tree-processing/SKILL.md +82 -194
- package/template/skills/zh/meta-meta/auto-model-selection/SKILL.md +51 -0
- package/template/skills/zh/meta-meta/pdf-review-dashboard/SKILL.md +55 -0
- package/template/skills/zh/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +79 -164
- package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +64 -185
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +95 -216
|
@@ -3,188 +3,33 @@ name: compliance-judgment
|
|
|
3
3
|
description: Determine whether extracted entities comply with verification rules. Use after entity extraction to make the pass/fail judgment for each rule on each document. Covers translating natural language rules into executable logic, choosing between Python calculation and LLM semantic judgment, and producing actionable comments on failures. Also use when designing the judgment step of a workflow or when a rule's judgment logic needs debugging.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
#
|
|
6
|
+
# Compliance Judgment
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
Judgment is the moment of truth. You have the extracted entity. You have the rule. Do they comply? The answer must be clear, correct, and — when the answer is no — accompanied by a concise, actionable comment.
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
## The Judgment Spectrum
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
Rules range from trivially deterministic to deeply semantic. Pick the right tool for each rule.
|
|
13
13
|
|
|
14
|
-
|
|
14
|
+
**Deterministic** — threshold checks, format validation, date arithmetic, cross-field consistency. Pure Python: free, instant, deterministic.
|
|
15
15
|
|
|
16
|
-
|
|
17
|
-
确定性判定(Python)◄───────────────────────►语义判定(LLM)
|
|
18
|
-
阈值检查 格式验证 日期计算 交叉一致 充分性 完整性 一致性 模板合规
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
左侧的判定用代码解决——免费、即时、确定。右侧的判定需要语言理解——需要 LLM、有成本、有不确定性。大多数规则处于中间位置,需要混合方法。
|
|
22
|
-
|
|
23
|
-
## 确定性判定(用 Python)
|
|
24
|
-
|
|
25
|
-
规则有明确、可计算的标准时,用 Python 实现判定逻辑。
|
|
26
|
-
|
|
27
|
-
### 阈值检查
|
|
28
|
-
|
|
29
|
-
金融监管中最常见的判定类型。
|
|
30
|
-
|
|
31
|
-
```python
|
|
32
|
-
# 资本充足率 ≥ 8%(银保监会要求)
|
|
33
|
-
result = "pass" if extracted_ratio >= 8.0 else "fail"
|
|
34
|
-
comment = f"资本充足率为{extracted_ratio}%,低于监管最低要求8.0%" if result == "fail" else ""
|
|
35
|
-
|
|
36
|
-
# 不良贷款率(通常监控 < 5%,但阈值因机构类型而异)
|
|
37
|
-
result = "pass" if npl_ratio < threshold else "fail"
|
|
38
|
-
|
|
39
|
-
# 拨备覆盖率 ≥ 150%
|
|
40
|
-
result = "pass" if provision_coverage >= 150.0 else "fail"
|
|
41
|
-
|
|
42
|
-
# 单一客户贷款集中度 ≤ 10%
|
|
43
|
-
result = "pass" if single_exposure <= 10.0 else "fail"
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
注意事项:
|
|
47
|
-
- 边界值的处理要明确:`>=` 还是 `>`?监管文件中"不低于"对应 `>=`,"低于"对应 `<`。
|
|
48
|
-
- 浮点精度:用 `Decimal` 或设定合理的容差(如 0.01%)。金融数据通常精确到小数点后两位。
|
|
49
|
-
|
|
50
|
-
### 格式验证
|
|
51
|
-
|
|
52
|
-
```python
|
|
53
|
-
import re
|
|
54
|
-
|
|
55
|
-
# 贷款编号格式:XX-YYYY-ZZZZZZ
|
|
56
|
-
result = "pass" if re.match(r"[A-Z]{2}-\d{4}-\d{6}", loan_number) else "fail"
|
|
57
|
-
|
|
58
|
-
# 统一社会信用代码:18位
|
|
59
|
-
result = "pass" if re.match(r"^[0-9A-Z]{18}$", uscc) else "fail"
|
|
60
|
-
|
|
61
|
-
# 手机号格式
|
|
62
|
-
result = "pass" if re.match(r"^1[3-9]\d{9}$", phone) else "fail"
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
### 日期计算
|
|
66
|
-
|
|
67
|
-
```python
|
|
68
|
-
from datetime import datetime, timedelta
|
|
69
|
-
|
|
70
|
-
# 合同签署日期在申请日期30天内
|
|
71
|
-
sign_date = datetime.strptime(extracted_sign_date, "%Y-%m-%d")
|
|
72
|
-
app_date = datetime.strptime(extracted_app_date, "%Y-%m-%d")
|
|
73
|
-
result = "pass" if (sign_date - app_date).days <= 30 else "fail"
|
|
74
|
-
comment = f"签署日期{extracted_sign_date}距申请日期{extracted_app_date}为{(sign_date - app_date).days}天,超过30天限制" if result == "fail" else ""
|
|
75
|
-
|
|
76
|
-
# 贷款到期日不早于合同约定
|
|
77
|
-
result = "pass" if actual_maturity >= contracted_maturity else "fail"
|
|
78
|
-
|
|
79
|
-
# 报告出具日期在报告期末后4个月内(年报要求)
|
|
80
|
-
report_date = datetime.strptime(extracted_report_date, "%Y-%m-%d")
|
|
81
|
-
period_end = datetime.strptime(extracted_period_end, "%Y-%m-%d")
|
|
82
|
-
deadline = period_end + timedelta(days=120) # 约4个月
|
|
83
|
-
result = "pass" if report_date <= deadline else "fail"
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
### 交叉一致性检查
|
|
87
|
-
|
|
88
|
-
```python
|
|
89
|
-
# 合计数等于明细之和
|
|
90
|
-
result = "pass" if abs(total - sum(items)) < 0.01 else "fail"
|
|
91
|
-
comment = f"合计数{total}与明细之和{sum(items)}不一致,差额{total - sum(items)}" if result == "fail" else ""
|
|
92
|
-
|
|
93
|
-
# 资产负债表平衡:资产 = 负债 + 所有者权益
|
|
94
|
-
result = "pass" if abs(assets - liabilities - equity) < 0.01 else "fail"
|
|
95
|
-
|
|
96
|
-
# 同一指标在不同章节的值一致
|
|
97
|
-
result = "pass" if value_in_summary == value_in_detail else "fail"
|
|
98
|
-
comment = f"摘要中为{value_in_summary},明细中为{value_in_detail}" if result == "fail" else ""
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
确定性判定是首选。它们免费、即时、可复现。能用 Python 解决的判定,绝不调用 LLM。
|
|
102
|
-
|
|
103
|
-
## 语义判定(用 LLM)
|
|
104
|
-
|
|
105
|
-
规则需要语言理解时使用 LLM。
|
|
106
|
-
|
|
107
|
-
### 充分性判定
|
|
108
|
-
|
|
109
|
-
"风险披露是否充分描述了主要风险因素。"
|
|
110
|
-
|
|
111
|
-
这无法用 Python 判定——"充分"是一个需要理解内容的语义概念。
|
|
112
|
-
|
|
113
|
-
LLM 判定提示词设计要点:
|
|
114
|
-
1. 提供规则全文(什么构成合规)。
|
|
115
|
-
2. 提供提取的文档内容(文档实际说了什么)。
|
|
116
|
-
3. 要求结构化输出:pass/fail、推理过程、评论。
|
|
117
|
-
4. 要求保守判定——只在明确不合规时判 fail。真正模糊的情况用 uncertain。
|
|
118
|
-
|
|
119
|
-
### 完整性判定
|
|
120
|
-
|
|
121
|
-
"管理层讨论与分析是否涵盖了财务状况、经营成果和现金流量三个方面。"
|
|
122
|
-
|
|
123
|
-
这是一个清单式的语义判定:内容是否覆盖了规定的多个主题。
|
|
124
|
-
|
|
125
|
-
```
|
|
126
|
-
请判定以下管理层讨论与分析是否涵盖以下三个必要主题:
|
|
127
|
-
1. 财务状况分析
|
|
128
|
-
2. 经营成果分析
|
|
129
|
-
3. 现金流量分析
|
|
130
|
-
|
|
131
|
-
文档内容:
|
|
132
|
-
{extracted_section}
|
|
133
|
-
|
|
134
|
-
对每个主题,判定是否有实质性讨论(不只是提及标题)。
|
|
135
|
-
返回 JSON:
|
|
136
|
-
{
|
|
137
|
-
"topic_1_covered": true/false,
|
|
138
|
-
"topic_2_covered": true/false,
|
|
139
|
-
"topic_3_covered": true/false,
|
|
140
|
-
"overall": "pass/fail",
|
|
141
|
-
"comment": "..."
|
|
142
|
-
}
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
### 一致性判定
|
|
146
|
-
|
|
147
|
-
"执行摘要与详细调查结果是否一致。"
|
|
148
|
-
|
|
149
|
-
需要对两段文本进行语义比较,检查是否存在矛盾或遗漏。
|
|
150
|
-
|
|
151
|
-
### 模板合规判定
|
|
16
|
+
**Semantic** — adequacy, completeness, consistency, compliance with templates, detecting misleading or suggestive language, assessing whether a description is fair and balanced. These require language understanding — use worker LLM.
|
|
152
17
|
|
|
153
|
-
"
|
|
18
|
+
Many real compliance rules require semantic judgment. "The risk disclosure must adequately describe the key risks" cannot be checked with regex or Python. "The contract description must not be misleading or suggestive" requires deep language understanding. Use worker LLM for these without hesitation.
|
|
154
19
|
|
|
155
|
-
|
|
20
|
+
Some rules combine both: extract a number (deterministic), compare to threshold (deterministic), then assess the explanation if borderline (semantic). The mix depends on the rule.
|
|
156
21
|
|
|
157
|
-
|
|
22
|
+
The right method is whatever achieves accuracy at lowest cost. Simple threshold checks don't need LLM. Semantic assessments don't benefit from Python. Most projects will have a mix — let the nature of each rule determine the method.
|
|
158
23
|
|
|
159
|
-
|
|
24
|
+
## Output Format
|
|
160
25
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
```
|
|
164
|
-
步骤1(正则提取):提取"资本充足率"对应的数值 → 12.5%
|
|
165
|
-
步骤2(Python判定):12.5% >= 8.0% → pass
|
|
166
|
-
```
|
|
167
|
-
如果步骤 1 提取失败或置信度低:
|
|
168
|
-
```
|
|
169
|
-
步骤3(LLM提取):请从以下内容中找出资本充足率的最新值 → 12.50%
|
|
170
|
-
步骤4(Python判定):12.50% >= 8.0% → pass
|
|
171
|
-
```
|
|
172
|
-
如果值在边界附近(如 8.02%):
|
|
173
|
-
```
|
|
174
|
-
步骤5(LLM审查):请确认12.50%是否为最终调整后的资本充足率,而非中间计算值
|
|
175
|
-
```
|
|
176
|
-
|
|
177
|
-
这个漏斗保证了:90% 的文档在步骤 2 就完成(零 LLM 成本),只有困难情况才调用 LLM。
|
|
178
|
-
|
|
179
|
-
## 输出格式
|
|
180
|
-
|
|
181
|
-
每条规则对每份文档的判定结果:
|
|
26
|
+
For each rule × document combination:
|
|
182
27
|
|
|
183
28
|
```json
|
|
184
29
|
{
|
|
185
30
|
"rule_id": "R001",
|
|
186
|
-
"document": "
|
|
187
|
-
"result": "pass",
|
|
31
|
+
"document": "report_2024_q1.pdf",
|
|
32
|
+
"result": "pass | fail | missing | error | uncertain",
|
|
188
33
|
"extracted_value": "12.5%",
|
|
189
34
|
"expected": ">= 8.0%",
|
|
190
35
|
"comment": "",
|
|
@@ -192,112 +37,46 @@ LLM 判定提示词设计要点:
|
|
|
192
37
|
}
|
|
193
38
|
```
|
|
194
39
|
|
|
195
|
-
|
|
40
|
+
**Result values:**
|
|
41
|
+
- **pass**: Entity complies with the rule.
|
|
42
|
+
- **fail**: Entity does not comply. Comment is required.
|
|
43
|
+
- **missing**: The entity could not be found in the document. This is different from fail — the information is absent, not non-compliant.
|
|
44
|
+
- **error**: Something went wrong during extraction or judgment (parsing failure, API error). Needs investigation.
|
|
45
|
+
- **uncertain**: The judgment is ambiguous. May need human review.
|
|
196
46
|
|
|
197
|
-
|
|
198
|
-
|---|------|---------|
|
|
199
|
-
| **pass** | 实体合规 | 通常无需评论 |
|
|
200
|
-
| **fail** | 实体不合规 | **必须**附带评论 |
|
|
201
|
-
| **missing** | 实体在文档中未找到 | 注明搜索范围 |
|
|
202
|
-
| **error** | 提取或判定过程出错 | 注明错误类型 |
|
|
203
|
-
| **uncertain** | 判定模糊,需人工审查 | 说明不确定原因 |
|
|
47
|
+
**Design exit criteria first:** Before writing judgment logic for a rule, define the exit conditions: what constitutes pass, what constitutes fail, what triggers escalation to human, how to handle empty/missing values, what value ranges are valid. Explicit exit criteria prevent ambiguous or inconsistent judgment.
|
|
204
48
|
|
|
205
|
-
**
|
|
49
|
+
**Prompt design:** Design prompts for what you want, not against what you don't want. "Don't include reasoning" is less reliable than extracting the verdict from structured output in postprocessing. Use output filtering instead of prompt negation.
|
|
206
50
|
|
|
207
|
-
|
|
51
|
+
**Comments:**
|
|
52
|
+
- Required only when result is `fail`. Skip for `pass` unless the developer user specifically requests pass comments.
|
|
53
|
+
- Be concise and factual: "Capital adequacy ratio is 7.2%, below the regulatory minimum of 8.0%."
|
|
54
|
+
- Do not editorialize: not "This is a serious violation that could result in penalties." Just state the facts.
|
|
55
|
+
- Include the extracted value and the expected value/condition for context.
|
|
208
56
|
|
|
209
|
-
|
|
57
|
+
### Lightweight Annotation Markup
|
|
210
58
|
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
```
|
|
214
|
-
"资本充足率为7.2%,低于监管最低要求8.0%。"
|
|
215
|
-
"贷款合同签署日期2024-05-15距申请日期2024-03-01为75天,超过规定的30天期限。"
|
|
216
|
-
"资产负债表不平衡:总资产1,234,567万元,负债+所有者权益为1,234,590万元,差额23万元。"
|
|
217
|
-
"未在风险管理章节中找到流动性风险的专项讨论。"
|
|
218
|
-
```
|
|
219
|
-
|
|
220
|
-
### 不好的评论
|
|
221
|
-
|
|
222
|
-
```
|
|
223
|
-
"不合规。" ← 没有具体信息
|
|
224
|
-
"资本充足率不达标,存在重大风险隐患。" ← 加了主观判断
|
|
225
|
-
"该银行的资本充足率为7.2%,根据银保监会2023年发布的……(长篇大论)" ← 过于冗长
|
|
226
|
-
```
|
|
227
|
-
|
|
228
|
-
### 评论原则
|
|
229
|
-
|
|
230
|
-
- **简洁事实**:提取值 + 期望值 + 差异,三句话以内。
|
|
231
|
-
- **仅在 fail 时给出**:pass 的结果不需要评论,除非开发者用户明确要求。
|
|
232
|
-
- **不加主观判断**:不说"严重"、"重大"、"令人担忧"。只陈述事实。
|
|
233
|
-
- **包含关键数值**:让审查人员无需回看原文就能理解问题。
|
|
234
|
-
|
|
235
|
-
### 轻量标注格式
|
|
236
|
-
|
|
237
|
-
为便于人工审查、节省 token 开销、以及在不同核查轮次之间做 diff 比较,判定结果也可以用紧凑的文本标注格式表达:
|
|
59
|
+
For human review, token-efficient logging, and clean diff comparisons, results can also be expressed in compact text markup:
|
|
238
60
|
|
|
239
61
|
```
|
|
240
62
|
[PASS] capital_adequacy <- 12.5% (>= 8.0%) | conf:0.95 | src:p3-s2
|
|
241
|
-
[FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note
|
|
242
|
-
[MISSING] collateral_value | conf:0.60 | note
|
|
243
|
-
```
|
|
244
|
-
|
|
245
|
-
此格式与上述 JSON 格式可无损互转。在以下场景中使用此格式:向开发者用户展示结果以便快速审阅、在演化迭代摘要中记录日志以节省 token、在核查轮次之间计算 diff。参见 `references/output-format.md` 获取完整的格式规范和转换规则。
|
|
246
|
-
|
|
247
|
-
## 判定顺序
|
|
248
|
-
|
|
249
|
-
有些规则之间存在依赖关系:
|
|
250
|
-
|
|
251
|
-
- **条件依赖**:规则 B 只在规则 A 通过时适用。"如果借款人为新客户(规则 A),则需要额外的尽调文件(规则 B)。"
|
|
252
|
-
- **值依赖**:规则 C 使用规则 A 计算的值。"风险加权资本比率(规则 A)决定了所需的拨备水平(规则 C)。"
|
|
253
|
-
- **逻辑依赖**:规则 D 只在规则 A 和 B 都失败时才需要检查。
|
|
254
|
-
|
|
255
|
-
在规则目录中标注这些依赖关系。按依赖顺序执行规则。将上游规则的结果作为下游规则的上下文传递。
|
|
256
|
-
|
|
257
|
-
### 依赖图示例
|
|
258
|
-
|
|
259
|
-
```
|
|
260
|
-
R001(资本充足率提取) → R002(资本充足率阈值判定)
|
|
261
|
-
R003(核心一级资本提取)→ R004(核心一级资本充足率计算)→ R002
|
|
262
|
-
R001 + R005(杠杆率)→ R006(综合评级)
|
|
63
|
+
[FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note:Signing overdue by 45 days
|
|
64
|
+
[MISSING] collateral_value | conf:0.60 | note:Collateral valuation not found in document
|
|
263
65
|
```
|
|
264
66
|
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
## 边缘情况处理
|
|
268
|
-
|
|
269
|
-
### 空提取
|
|
270
|
-
|
|
271
|
-
实体未找到。默认判定为 **missing**,而非 fail。缺失值是提取层面的问题,不是合规层面的问题。将其反馈给解析和提取步骤,可能需要升级解析器或调整提取策略。
|
|
272
|
-
|
|
273
|
-
### 多值冲突
|
|
274
|
-
|
|
275
|
-
文档中同一实体出现在多处,且值不一致。
|
|
276
|
-
|
|
277
|
-
- 标记为 **uncertain**。
|
|
278
|
-
- 在评论中列出所有找到的值及其来源位置。
|
|
279
|
-
- 如果规则指定了优先来源(如"以审计报告中的数值为准"),使用该来源的值。
|
|
280
|
-
|
|
281
|
-
### 条件规则
|
|
282
|
-
|
|
283
|
-
"如果贷款金额超过 1000 万元,则需要提供担保。"
|
|
284
|
-
|
|
285
|
-
- 先检查条件:贷款金额是否超过 1000 万?
|
|
286
|
-
- 条件不满足 → 规则不适用 → 结果为 pass(或 not_applicable)。
|
|
287
|
-
- 条件满足 → 继续检查后续要求。
|
|
67
|
+
This format is losslessly convertible to and from the JSON format above. Use it when presenting results to the developer user for quick review, logging to evolution iteration summaries where token economy matters, or computing diffs between verification runs. See `references/output-format.md` for the full specification and conversion rules.
|
|
288
68
|
|
|
289
|
-
|
|
69
|
+
## Judgment Ordering
|
|
290
70
|
|
|
291
|
-
|
|
71
|
+
Some rules depend on the results of other rules:
|
|
72
|
+
- Rule B might only apply if Rule A passes. "If the borrower is a new customer (Rule A), then additional documentation is required (Rule B)."
|
|
73
|
+
- Rule C might use a value computed by Rule A. "The risk-weighted capital ratio (Rule A) determines the required reserve level (Rule C)."
|
|
292
74
|
|
|
293
|
-
|
|
294
|
-
- 在文档中搜索关键词("关联方"+"担保"+"承诺")。
|
|
295
|
-
- 如果找到匹配,提取上下文送 LLM 确认是否构成实际的担保承诺(可能只是声明"未提供担保")。
|
|
296
|
-
- 如果没有找到任何匹配,判定 pass,但置信度降低(因为搜索可能不完整)。
|
|
75
|
+
Map these dependencies in the rule catalog. Execute rules in dependency order. Pass upstream results as context to downstream rules.
|
|
297
76
|
|
|
298
|
-
|
|
77
|
+
## Handling Edge Cases
|
|
299
78
|
|
|
300
|
-
|
|
301
|
-
-
|
|
302
|
-
-
|
|
303
|
-
-
|
|
79
|
+
- **Null extraction**: The entity was not found. Default to `missing`, not `fail`. A missing value is an extraction problem, not a compliance problem.
|
|
80
|
+
- **Multiple values**: The document contains the entity in multiple places with different values. Flag as `uncertain`. Report all found values.
|
|
81
|
+
- **Conditional rules**: "If the loan exceeds 1M, then collateral is required." Check the condition before applying the rule. If the condition is not met, the rule does not apply — result is `pass` (or `not_applicable` if you add that category).
|
|
82
|
+
- **Negative results**: Some rules check for absence. "The document must NOT contain guarantees to related parties." Searching for absence is harder than searching for presence. Be thorough in the search, then be confident in the negative.
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: document-chunking
|
|
3
|
+
description: >
|
|
4
|
+
Fast, cheap chunking for processing batches of sample and input documents.
|
|
5
|
+
Use when you need to split documents into manageable pieces for initial observation,
|
|
6
|
+
data sensibility checks, or feeding to extraction workflows. Not for production
|
|
7
|
+
verification chunking — for that, use tree-processing to design a tailored chunking script.
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Document Chunking
|
|
11
|
+
|
|
12
|
+
Split documents into pieces for downstream processing. This is the fast, cheap version — for batch processing of samples and inputs, not for precision verification workflows.
|
|
13
|
+
|
|
14
|
+
## Methods
|
|
15
|
+
|
|
16
|
+
**Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
|
|
17
|
+
|
|
18
|
+
**Fixed-size chunks** — split by character/token count with overlap. Good for search and initial observation. Typical: 2000-4000 chars with 200 char overlap.
|
|
19
|
+
|
|
20
|
+
**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Use regex patterns for the document's header convention.
|
|
21
|
+
|
|
22
|
+
## When to Use What
|
|
23
|
+
|
|
24
|
+
Pick the simplest method that serves the task:
|
|
25
|
+
- Batch document observation → page-level
|
|
26
|
+
- Full-text search index → fixed-size with overlap
|
|
27
|
+
- Section-level extraction → header-based
|
|
28
|
+
- Table of contents available → parse TOC for structure
|
|
29
|
+
|
|
30
|
+
## Relationship to tree-processing
|
|
31
|
+
|
|
32
|
+
This skill is for quick, cheap chunking during exploration and batch processing. When you need production-grade chunking for verification workflows — where the chunking mechanism must be precise, consistent, and coded as a script — use `tree-processing` instead.
|
|
@@ -3,166 +3,99 @@ name: document-parsing
|
|
|
3
3
|
description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
#
|
|
6
|
+
# Document Parsing
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
Parsing is the foundation. If the text is wrong, everything downstream is wrong. But parsing is also a cost center — do not use expensive vision models when simple text extraction works.
|
|
9
9
|
|
|
10
|
-
##
|
|
10
|
+
## The Minimum Viable Parser Principle
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
Start with the simplest parser. Escalate only when necessary. This is not about saving money — it is about producing the most reliable output. Simple parsers have fewer failure modes.
|
|
13
13
|
|
|
14
|
-
|
|
14
|
+
### Level 1: Direct Text Extraction
|
|
15
|
+
- Tool: pdfjs-dist or similar PDF text extraction.
|
|
16
|
+
- When: Well-formed digital PDFs with embedded text. This covers most modern business documents.
|
|
17
|
+
- Output: Raw text with basic structure preserved (paragraphs, basic formatting).
|
|
18
|
+
- Limitations: Tables may come out as messy text. Charts and images are invisible. Scanned PDFs produce nothing.
|
|
15
19
|
|
|
16
|
-
### Level
|
|
20
|
+
### Level 2: Provider VLM (Vision Language Model)
|
|
21
|
+
- Tool: VLM models from configured provider (VLM_TIER3 for cheap OCR, VLM_TIER1 for complex interpretation).
|
|
22
|
+
- When: Level 1 produces garbled/incomplete text, scanned PDFs, image-based PDFs.
|
|
23
|
+
- Output: Recognized text from page images, or structured interpretation (table as markdown, chart data as JSON).
|
|
24
|
+
- Calling a provider VLM is more convenient and reliable than deploying local OCR. Use the cheapest VLM tier first; escalate to a more capable tier for complex tables/charts.
|
|
17
25
|
|
|
18
|
-
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
-
|
|
22
|
-
- **成本**:零 API 调用,毫秒级速度。
|
|
26
|
+
### Level 3: MineRU API or Local Tools (Optional)
|
|
27
|
+
- Tool: MineRU API, pdfplumber, or locally deployed OCR — if configured.
|
|
28
|
+
- When: Provider VLM is unavailable or too expensive for batch processing.
|
|
29
|
+
- These are optional fallbacks. Most users will use Level 1 + Level 2.
|
|
23
30
|
|
|
24
|
-
|
|
31
|
+
## Quality Detection
|
|
25
32
|
|
|
26
|
-
|
|
33
|
+
How to know when to escalate:
|
|
27
34
|
|
|
28
|
-
-
|
|
29
|
-
-
|
|
30
|
-
-
|
|
31
|
-
-
|
|
32
|
-
-
|
|
35
|
+
- **Low character count**: The document has pages but extracted text is very short. Likely a scanned PDF.
|
|
36
|
+
- **Garbled text**: Unusual character sequences, encoding errors, or meaningless text patterns.
|
|
37
|
+
- **Missing expected sections**: The table of contents mentions Chapter 5 but no Chapter 5 text was extracted.
|
|
38
|
+
- **Table artifacts**: Columns of numbers without alignment, cell content mixed with headers, or table borders appearing as characters.
|
|
39
|
+
- **Missing numbers in financial tables**: If a financial document's key metrics are not in the extracted text, the tables were probably not parsed.
|
|
33
40
|
|
|
34
|
-
|
|
41
|
+
Write a quick quality check after parsing and before proceeding. If quality is insufficient, escalate to the next parser level.
|
|
35
42
|
|
|
36
|
-
|
|
37
|
-
- **适用场景**:扫描件 PDF、影印版监管文件、历史档案(2010年以前的银行文件很多是扫描件)。
|
|
38
|
-
- **输出**:从图像中识别出的文字。
|
|
39
|
-
- **局限**:速度慢、消耗 API 调用、可能引入识别错误(繁体/简体混淆、表格线干扰等)。
|
|
40
|
-
- **注意事项**:OCR 对中文竖排文本、印章遮盖区域、手写批注的处理能力有限。遇到这些情况要做额外质量检查。
|
|
43
|
+
### Parse Quality Score
|
|
41
44
|
|
|
42
|
-
|
|
45
|
+
Compute a quality score (0.0 to 1.0) from weighted heuristics to make escalation decisions systematic rather than ad-hoc. A recommended starting framework:
|
|
43
46
|
|
|
44
|
-
-
|
|
45
|
-
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
- 混合排版:文字与图像交织的页面。
|
|
49
|
-
- **输出**:对视觉内容的结构化解读(表格转 markdown、图表数据转 JSON)。
|
|
50
|
-
- **局限**:成本高、速度慢。只在视觉内容确实需要语义理解时使用。
|
|
47
|
+
- **Character density** (weight ~0.3): actual character count / expected characters for the document's page count. A 10-page PDF that yields only 200 characters likely failed.
|
|
48
|
+
- **Garble ratio** (weight ~0.2): fraction of characters that are common CJK/Latin vs control characters, unusual sequences, or encoding artifacts.
|
|
49
|
+
- **Section completeness** (weight ~0.3): if the document has a table of contents, what fraction of TOC entries have matching content in the extracted text?
|
|
50
|
+
- **Table integrity** (weight ~0.2): for financial documents, are key numeric values that should appear in tables actually present in the extracted text?
|
|
51
51
|
|
|
52
|
-
|
|
52
|
+
**Escalation thresholds** (recommended defaults — adjust freely):
|
|
53
|
+
- Score >= 0.7: accept this parser level, proceed to downstream processing.
|
|
54
|
+
- Score 0.4-0.7: escalate to the next parser level, re-parse, re-score.
|
|
55
|
+
- Score < 0.4: skip directly to Level 3 (OCR) or Level 4 (vision) depending on document characteristics.
|
|
53
56
|
|
|
54
|
-
|
|
57
|
+
**Lock-in**: once a parser level produces an acceptable score for a document type, record that level. Do not re-evaluate unless a downstream verification failure is traced back to a parsing issue.
|
|
55
58
|
|
|
56
|
-
|
|
59
|
+
These weights, thresholds, and the scoring approach itself are starting points. The coding agent should design whatever quality assessment works for the specific document types at hand — a simple pass/fail heuristic may be sufficient for some scenarios; a more nuanced scoring function may be needed for others. The important pattern is: **measure quality → compare to threshold → decide whether to escalate**.
|
|
57
60
|
|
|
58
|
-
-
|
|
59
|
-
- **乱码检测**:出现大量连续非常用字符、编码错误符号(□、■、?)、或无意义字符序列。常见于编码不匹配或字体嵌入异常的 PDF。
|
|
60
|
-
- **章节缺失**:目录显示有"第五章 风险管理",但提取文本中找不到对应内容。可能该章节是扫描插页或图片格式。
|
|
61
|
-
- **表格异常**:
|
|
62
|
-
- 数字列缺少对齐,数值与表头无法对应。
|
|
63
|
-
- 单元格内容与相邻单元格混合。
|
|
64
|
-
- 表格线字符(|、+、-)出现在文本中。
|
|
65
|
-
- 关键财务数据缺失(资本充足率、不良贷款率、净利润等数字在文本中找不到)。
|
|
66
|
-
- **页码断裂**:连续页码中有跳跃,说明某些页面可能未被提取。
|
|
61
|
+
This follows the same tier-transition pattern as model tier selection in `skill-to-workflow`: a quality/accuracy score drives the decision to stay, escalate, or skip tiers.
|
|
67
62
|
|
|
68
|
-
|
|
63
|
+
## Table Handling
|
|
69
64
|
|
|
70
|
-
|
|
71
|
-
解析完成 → 检查字符数 → 检查乱码比例 → 检查章节完整性 → 检查关键表格
|
|
72
|
-
↓ 任一项不合格
|
|
73
|
-
升级到下一级解析器 → 重新解析 → 再次检查
|
|
74
|
-
```
|
|
65
|
+
Tables are critical in financial documents (balance sheets, ratio tables, compliance metrics). They deserve special attention:
|
|
75
66
|
|
|
76
|
-
|
|
67
|
+
1. **Detection**: Identify table regions. Look for grid patterns, consistent column spacing, or explicit table markers.
|
|
68
|
+
2. **Extraction**: Extract cell-by-cell content. Preserve the row-column relationship.
|
|
69
|
+
3. **Reconstruction**: Convert to a structured format (markdown table, JSON array of rows, or CSV).
|
|
70
|
+
4. **Validation**: Spot-check that key values in the reconstructed table match what is visible in the document.
|
|
77
71
|
|
|
78
|
-
|
|
72
|
+
When the standard parser fails on tables, try the vision model approach: send the table image (cropped from the PDF page) to a vision model and ask it to produce a markdown table.
|
|
79
73
|
|
|
80
|
-
|
|
74
|
+
## Chart Handling
|
|
81
75
|
|
|
82
|
-
|
|
83
|
-
- **字符密度**(~0.3):实际提取字符数 / 按页数估算的预期字符数。远低于预期说明大量内容未被提取。
|
|
84
|
-
- **乱码比例**(~0.2):常用字符占比与异常序列占比的对比。编码问题在此暴露。
|
|
85
|
-
- **章节完整性**(~0.3):目录条目在正文中有对应内容的比例。缺失章节是解析失败的强信号。
|
|
86
|
-
- **表格完整性**(~0.2):关键数值(如总资产、净利润、资本充足率)在提取文本中是否可检索到。
|
|
76
|
+
Charts (bar charts, line charts, pie charts) occasionally contain data needed for verification:
|
|
87
77
|
|
|
88
|
-
|
|
89
|
-
-
|
|
90
|
-
-
|
|
91
|
-
- < 0.4:跳过中间级别,直接使用 OCR 或视觉模型。
|
|
78
|
+
- Extract the chart image from the document.
|
|
79
|
+
- Send to a vision model with a prompt: "Extract the data points, labels, and values from this chart. Return as a JSON array."
|
|
80
|
+
- Validate the extracted data against any nearby text or table that might contain the same numbers.
|
|
92
81
|
|
|
93
|
-
|
|
82
|
+
This is expensive. Only do it when a verification rule specifically requires data from a chart and that data is not available in text elsewhere in the document.
|
|
94
83
|
|
|
95
|
-
|
|
84
|
+
## Output Format
|
|
96
85
|
|
|
97
|
-
|
|
86
|
+
Parsed documents should be saved as clean markdown:
|
|
98
87
|
|
|
99
|
-
##
|
|
88
|
+
- Preserve the document's heading hierarchy (# Chapter, ## Section, ### Subsection).
|
|
89
|
+
- Preserve lists, numbered or bulleted.
|
|
90
|
+
- Convert tables to markdown table format.
|
|
91
|
+
- Note page boundaries if relevant (some rules reference specific pages).
|
|
92
|
+
- Strip noise: headers, footers, page numbers, watermarks (unless a rule specifically checks for them).
|
|
100
93
|
|
|
101
|
-
|
|
94
|
+
Save parsed output alongside the original document for reuse across rules.
|
|
102
95
|
|
|
103
|
-
|
|
96
|
+
## Caching
|
|
104
97
|
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
3. **重建**:转换为结构化格式。
|
|
111
|
-
- 首选 markdown 表格(人可读、LLM 可理解)。
|
|
112
|
-
- 复杂表格可用 JSON 行数组(便于程序处理)。
|
|
113
|
-
- 保留原始表头层级(如"期末余额"下分"本期"和"上期"两个子列)。
|
|
114
|
-
|
|
115
|
-
4. **验证**:抽检重建后的表格与原文档是否一致。
|
|
116
|
-
- 选取 3-5 个关键数值,对照原 PDF 页面确认。
|
|
117
|
-
- 检查行数和列数是否匹配。
|
|
118
|
-
- 验证合计行是否等于明细行之和(财务报表通常有此约束)。
|
|
119
|
-
|
|
120
|
-
### 表格提取失败时
|
|
121
|
-
|
|
122
|
-
当 Level 1-2 无法正确提取表格:
|
|
123
|
-
- 从 PDF 中裁剪表格区域的图片。
|
|
124
|
-
- 发送给视觉模型,提示词要求输出 markdown 表格。
|
|
125
|
-
- 对视觉模型的输出做与上述相同的验证步骤。
|
|
126
|
-
|
|
127
|
-
不要因为一页表格提取失败就对整份文档使用 Level 4。只对出问题的表格页面升级。
|
|
128
|
-
|
|
129
|
-
## 图表处理
|
|
130
|
-
|
|
131
|
-
图表(柱状图、折线图、饼图、散点图)偶尔包含核查所需的数据:
|
|
132
|
-
|
|
133
|
-
- 从文档中提取图表图片(按页面或按区域裁剪)。
|
|
134
|
-
- 发送给视觉模型,提示词示例:
|
|
135
|
-
```
|
|
136
|
-
请提取此图表中的所有数据点、标签和数值。
|
|
137
|
-
返回 JSON 数组格式,每个元素包含 label 和 value 字段。
|
|
138
|
-
如有多个系列,请分别标注系列名称。
|
|
139
|
-
```
|
|
140
|
-
- 将提取的数据与文档中其他位置的文本或表格交叉验证——图表的数据通常在正文或附表中也能找到。
|
|
141
|
-
|
|
142
|
-
这是高成本操作。只在核查规则明确要求图表中的数据、且该数据无法从文本中获取时才执行。
|
|
143
|
-
|
|
144
|
-
## 输出格式
|
|
145
|
-
|
|
146
|
-
解析后的文档应保存为干净的 markdown 文件:
|
|
147
|
-
|
|
148
|
-
- **保留标题层级**:`# 第一章 总则`、`## 第一节 定义`、`### 一、适用范围`。与原文档的层级结构一一对应。
|
|
149
|
-
- **保留列表**:有序列表和无序列表保持原有编号方式。
|
|
150
|
-
- **表格转换**:转为 markdown 表格格式。复杂表格保留足够的上下文说明。
|
|
151
|
-
- **页码标注**:在页面边界处标注 `<!-- Page X -->`。部分核查规则引用特定页码。
|
|
152
|
-
- **清除噪声**:页眉、页脚、页码、水印一律去除(除非某条规则专门检查这些内容)。
|
|
153
|
-
- **保留原文措辞**:不要改写原文语句。解析是忠实转录,不是翻译或摘要。
|
|
154
|
-
|
|
155
|
-
文件命名建议:原文件名加 `.parsed.md` 后缀,存放在同一目录下。
|
|
156
|
-
|
|
157
|
-
## 缓存与复用
|
|
158
|
-
|
|
159
|
-
解析是耗时操作(尤其 Level 3-4),必须缓存结果以避免重复劳动:
|
|
160
|
-
|
|
161
|
-
- 将解析后的 markdown 文件保存在原文件旁边,供所有规则复用。
|
|
162
|
-
- 记录解析器级别:在 markdown 文件开头或配套的元数据文件中注明使用了哪个级别的解析器。
|
|
163
|
-
- 仅在以下情况重新解析:
|
|
164
|
-
- 原始文件被替换或更新。
|
|
165
|
-
- 某条规则的核查失败被追溯到解析质量问题,需要升级解析器。
|
|
166
|
-
- 缓存文件损坏或丢失。
|
|
167
|
-
|
|
168
|
-
跨规则共享解析结果是效率的关键。一份 300 页的年报可能被 50 条规则引用——解析一次,使用 50 次。
|
|
98
|
+
Parsing is expensive (especially Level 3-4). Cache parsed output:
|
|
99
|
+
- Store the parsed markdown alongside the original file.
|
|
100
|
+
- Track which parser level produced it.
|
|
101
|
+
- Re-parse only when: the original file changes, a rule requires higher-quality parsing than what is cached, or a verification failure is traced back to a parsing issue.
|