kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/package.json +1 -1
  2. package/src/agent/context.js +17 -1
  3. package/src/agent/engine.js +85 -8
  4. package/src/agent/llm-client.js +24 -1
  5. package/src/agent/pipelines/_milestone-derive.js +78 -7
  6. package/src/agent/pipelines/skill-authoring.js +19 -2
  7. package/src/agent/tools/release.js +94 -1
  8. package/src/cli/index.js +28 -7
  9. package/template/.env.template +1 -1
  10. package/template/AGENT.md +2 -2
  11. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  12. package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
  13. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  14. package/template/skills/en/confidence-system/SKILL.md +30 -8
  15. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  16. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  17. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  18. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  19. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  20. package/template/skills/en/document-chunking/SKILL.md +99 -15
  21. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  22. package/template/skills/en/quality-control/SKILL.md +14 -0
  23. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  24. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  25. package/template/skills/en/skill-authoring/SKILL.md +52 -8
  26. package/template/skills/en/skill-creator/SKILL.md +25 -3
  27. package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
  28. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  29. package/template/skills/en/tree-processing/SKILL.md +1 -1
  30. package/template/skills/en/version-control/SKILL.md +15 -0
  31. package/template/skills/en/work-decomposition/SKILL.md +21 -35
  32. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  33. package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
  34. package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
  35. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  36. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  37. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  38. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  39. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  40. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  41. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  42. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  43. package/template/skills/zh/document-chunking/SKILL.md +96 -20
  44. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  45. package/template/skills/zh/entity-extraction/SKILL.md +14 -4
  46. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  47. package/template/skills/zh/quality-control/SKILL.md +14 -0
  48. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  49. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  50. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  51. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  52. package/template/skills/zh/skill-authoring/SKILL.md +108 -69
  53. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  54. package/template/skills/zh/skill-creator/SKILL.md +71 -61
  55. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  56. package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
  57. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  58. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  59. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  60. package/template/skills/zh/tree-processing/SKILL.md +1 -1
  61. package/template/skills/zh/version-control/SKILL.md +15 -0
  62. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  63. package/template/skills/zh/work-decomposition/SKILL.md +21 -33
@@ -4,81 +4,83 @@ tier: meta
4
4
  description: Extract and organize business verification rules from regulation documents into discrete, testable units. Use when processing documents in Rules/ to identify individual verification rules, when decomposing a regulation into atomic checks, or when the developer user adds new regulation files. Covers reading regulation text, identifying rule boundaries, determining granularity, handling cross-references, and producing a rule catalog. Also use when rules are provided in structured formats like xlsx or csv.
5
5
  ---
6
6
 
7
- # Rule Extraction
7
+ # 规则抽取(Rule Extraction
8
8
 
9
- Rules are the atoms of verification. Each rule you extract will become its own skill folder, its own workflow, and its own production pipeline.
9
+ 规则是核查的最小单元。每条抽出的规则将拥有自己的 skill 目录、自己的 workflow、自己的生产管道。
10
10
 
11
- ## How This Differs from Data Extraction
11
+ ## 与数据抽取(data extraction)的区别
12
12
 
13
- Rule extraction is a **one-off task** at the start of a project. You read regulation documents and decompose them into discrete, testable rules. This is fuzzy, agile work — rules are read by you (a SOTA agent), so the schema can be messy and evolve freely.
13
+ 规则抽取是项目启动时**一次性**的工作:你阅读源文档,把它拆成离散、可测试的规则。这件事是模糊、敏捷的——规则是由你(SOTA agent)来读的,所以 schema 可以乱、可以自由演化。
14
14
 
15
- Data/entity extraction (`entity-extraction`) is the **repeating task** that runs on every document being verified. It must fit a unified, stable schema because it feeds into automated workflows.
15
+ 数据/实体抽取(`entity-extraction`)是**重复性**任务,针对每一份被核查的文档都会跑一遍。它必须符合统一、稳定的 schema,因为下游是自动化 workflow。
16
16
 
17
- Don't conflate the two. Rule extraction happens once; data extraction happens on every document.
17
+ 不要把两者混为一谈。规则抽取只发生一次;数据抽取每一份文档都发生。
18
18
 
19
- ## Rule Structure: Location → Extraction → Judgment
19
+ ## 源文先行(Source-first sequencing)
20
20
 
21
- Every verification rule decomposes into three parts:
21
+ 抽取规则时,**先把源文档读透**。只有当你基于源文本身完成第一遍完整的规则编目后,才打开样本文档。早期偷瞄样本"看看哪些规则重要"的诱惑很大——但这会让你的规则集向"样本恰好触及的部分"偏移,并默默丢掉样本未覆盖的规则。
22
22
 
23
- 1. **Location**: Where in the document to look (which chapter, section, table, or full document).
24
- 2. **Extraction**: What data to pull from that location (a number, a date, a clause, a description).
25
- 3. **Judgment**: How to determine pass/fail (threshold comparison, semantic assessment, cross-field check).
23
+ 领域专家的工作顺序是:先读源文、建立理解,再用样本验证——而不是反过来。KC 相对通用 Agent 的差异化优势在于"长上下文下的系统性准确性";这种优势只有在你以**原文为锚**而不是以**样例为锚**时才能复利。
26
24
 
27
- When extracting a rule, explicitly note all three parts. This determines the downstream pipeline structure:
28
- - Full-document rules need no location step.
29
- - Single-section rules need one location step.
30
- - Cross-section rules (comparing values across chapters) need multiple location steps.
25
+ ## 规则结构:Location Extraction Judgment
31
26
 
32
- Classify each rule's scope accordingly — it affects how the verification workflow is structured.
27
+ 每条核查规则都可以拆成三部分:
33
28
 
34
- ## Philosophy
29
+ 1. **Location(定位)**:在文档的哪里去看(哪一章、哪一节、哪张表,或整份文档)。
30
+ 2. **Extraction(抽取)**:从那个位置取什么数据(数字、日期、条款、描述)。
31
+ 3. **Judgment(判断)**:怎么判定通过/失败(阈值比较、语义评估、跨字段核对)。
35
32
 
36
- A well-extracted rule is:
37
- - **Atomic**: it checks one thing. "The borrower's debt-to-income ratio must not exceed 50%" is one rule. "The loan agreement must comply with Regulation X" is not — it is a container for many rules.
38
- - **Testable**: given a document, you can definitively say whether the rule passes or fails (or is not applicable).
39
- - **Self-contained**: the rule's meaning does not require reading ten other rules to understand. Cross-references should be resolved into the rule's description.
40
- - **Scoped**: you know WHERE in the document to look. "Chapter 3, Section 2" or "the risk disclosure section" or "the signature page."
33
+ 抽取规则时显式写明三部分。它们决定了下游 pipeline 的结构:
34
+ - 整篇文档级别的规则不需要 location 步骤。
35
+ - 单段级别的规则需要一步 location。
36
+ - 跨段对比的规则(跨章节比较数值)需要多步 location。
41
37
 
42
- But perfection is the enemy of progress. Extract rules at the granularity that feels right for the regulation and the business scenario. You will iterate. The developer user will tell you if rules are too coarse or too fine.
38
+ 按这个口径给每条规则的 scope 分类——它会影响 verification workflow 的结构。
43
39
 
44
- ## Rule Schema Design Principles
40
+ ## 哲学
45
41
 
46
- Individual rules should be atomic and testable (above). The rule catalog as a whole must also satisfy system-level properties:
42
+ 一条抽取得当的规则应当是:
43
+ - **原子的(Atomic)**:它只核查一件事。"借款人的债务收入比不得超过 50%"是一条规则;"贷款合同必须符合 X 法规"不是——它是一堆规则的容器。
44
+ - **可测试的(Testable)**:给定一份文档,你能明确说出这条规则是 PASS、FAIL,还是 NOT_APPLICABLE。
45
+ - **自包含的(Self-contained)**:理解这条规则不需要再读另外十条规则。交叉引用应当在规则描述里被解析掉。
46
+ - **有范围的(Scoped)**:你知道**在文档的哪里**去看。"第 3 章第 2 节"、"风险揭示段"、"签字页"。
47
47
 
48
- ### Coverage Target
49
- Extracted rules should cover at least 95% of the regulation's checkable requirements. After initial extraction, perform a coverage audit: read the source regulation end-to-end and mark which paragraphs are covered by at least one rule. Uncovered paragraphs are either non-checkable (definitions, context) or gaps to close.
48
+ 但是完美是进步的敌人。在源文与业务场景所适合的颗粒度上抽取规则即可。后面会迭代。开发者用户会告诉你哪些规则太粗、哪些太细。
50
49
 
51
- ### Atomicity Test
52
- One rule = one pass/fail outcome. If a rule can produce two independent pass/fail results, it should be two rules. Ask: "Can this rule partially pass?" If yes, decompose further.
50
+ ## 规则 schema 的设计原则
53
51
 
54
- ### Ambiguity Minimization
55
- No two rules should produce contradictory results on the same document. After extraction, review rule pairs that touch overlapping scope. If Rule A says pass and Rule B says fail for the same entity, their scope boundaries are unclear — fix them.
52
+ 单条规则要原子、要可测(见上一节)。规则集合作为一个整体,还要满足系统级的属性:
56
53
 
57
- ### Downstream Anticipation
58
- Rules will be distilled into workflows (see `skill-to-workflow`). Design with distillation in mind: clear input/output boundaries, explicit judgment criteria, minimal reliance on implicit domain knowledge. If a rule requires reading between the lines, make the interpretation explicit. Use `task-decomposition` to identify natural boundaries between rules.
54
+ ### 覆盖率目标
55
+ 抽出的规则应当覆盖源文中至少 95% 的可核查要求。首轮抽取完成后做一次覆盖审计:把源文从头到尾读一遍,标记哪些段落至少被某一条规则覆盖。未覆盖的段落要么是不可核查(定义、背景),要么就是要补的缺口。
59
56
 
60
- ### Catalog Versioning
61
- When rules change (additions, modifications, deprecations), version the entire rule catalog as a unit. Individual rule versions track specific rules; the catalog version tracks the coherent set. Record the catalog version in `versions.json` alongside individual rule versions.
57
+ ### 原子性测试
58
+ 一条规则 = 一个 PASS/FAIL 结果。如果一条规则能产生两个互相独立的 PASS/FAIL 判定,那它应该是两条规则。问自己:"这条规则有没有可能'部分通过'?"如果有,继续拆。
62
59
 
63
- ## Granularity Calibration (read before extracting)
60
+ ### 歧义最小化
61
+ 两条规则不应在同一份文档上得出互相矛盾的结果。抽完之后回头看看那些 scope 有重叠的规则对:如果同一对象 A 说 PASS、B 说 FAIL,说明它们的 scope 边界不清楚——把边界补清楚。
64
62
 
65
- A well-extracted rule catalog has **10-20 rules per typical regulation PDF**
66
- (2025 banking/insurance disclosure regs, 30-80 pages). Over-extraction into
67
- 60-100 rules per regulation signals you're treating every clause as its own
68
- rule — which downstream consumers (skill-authoring, workflow-run) can't
69
- distinguish meaningful checks from boilerplate.
63
+ ### 下游预设
64
+ 规则会被蒸馏为 workflow(见 `skill-to-workflow`)。设计时就要把蒸馏放在脑子里:清晰的输入/输出边界、显式的判定标准、尽量不依赖隐式领域知识。如果某条规则需要"读出言外之意",就把那层意思显式写出来。用 `task-decomposition` 来识别规则之间自然的边界。
70
65
 
71
- If your first pass produces more than ~25 rules for a single regulation:
72
- - **Merge rules that share evidence and fail together** (e.g., "must disclose
73
- X" and "must disclose Y" where both come from the same required-fields
74
- table → one rule: "must disclose the required-fields list including X, Y").
75
- - **Drop procedural language** that isn't checkable against a report
76
- (definitions, scope statements, references to other regs that just
77
- transitively apply).
78
- - **Keep only checkable obligations, prohibitions, and thresholds** — the
79
- things where you can read a sample report and say pass or fail.
66
+ ### 目录版本化
67
+ 规则发生变化(新增、修改、废止)时,把整个规则目录作为一个整体来版本化。单条规则的版本号跟踪具体规则;目录版本号跟踪整体一致的集合。把目录版本号和各规则版本号一起记到 `versions.json`。
80
68
 
81
- ### Sample "good" rule
69
+ ## 颗粒度校准(开抽前先读这一节)
70
+
71
+ 规则目录的素材千差万别——正式法规、内部手册、判决/裁定汇编、法律意见、专家整理的规则表、监管问答。**不存在一个普适的"每页 N 条规则"的标准**。用逻辑而非数字来校准:
72
+
73
+ - **原子性才是真正的判据**。一条规则如果能产生两个互相独立的 PASS/FAIL 结果,那就是两条规则。一条规则的判定如果需要核对源文里三段不同的内容,那大概率应该是三条规则。
74
+ - **样板文字不是规则**。定义条款、适用范围说明、纯粹引用其他规则的"传递性"条款、无法对目标文档做核查的程序性条款——这些都不应当变成规则。
75
+ - **只保留可核查的义务、禁止与阈值** —— 你能拿着目标文档说出 PASS / FAIL / NOT_APPLICABLE 的那些。
76
+
77
+ 如果你的第一遍抽取感觉太粗(一章一条,忽略了章内多个独立义务)—— 拆细。如果感觉太细(定义章节里每一句都成规则)—— 合并或删。然后:
78
+
79
+ - **把共享证据、一同失败的规则合并**(例如:"必须披露 X"和"必须披露 Y"都来自同一张必填字段表 → 合成一条:"必须披露包括 X、Y 的必填字段列表")。
80
+ - **删掉无法对目标文档进行核查的程序性条款**。
81
+ - **把每条留下来的规则改写成"可证伪陈述"** —— 如果你说不出"什么情况下这条规则会失败",你还没真正抽出一条规则。
82
+
83
+ ### 一条"好规则"的样例
82
84
 
83
85
  ```json
84
86
  {
@@ -93,151 +95,120 @@ If your first pass produces more than ~25 rules for a single regulation:
93
95
  }
94
96
  ```
95
97
 
96
- Note: one pass/fail outcome, a single `source_ref` to a specific clause,
97
- clear applicability scope. Skill-authoring can write `check_r014.py` from
98
- this alone.
99
-
100
- ### Cross-regulation dedup (when working across multiple PDFs)
101
-
102
- If the developer user provides N regulations, rules from later regs often
103
- duplicate cross-cutting requirements already captured by earlier ones
104
- (e.g., 资管新规 2018 generic disclosure rule vs. 信披办法 2025's specific
105
- version). Before emitting a rule from reg-N:
106
-
107
- 1. **Check the existing catalog.** Use `rule_catalog` (operation: list) to
108
- see what's already there. Skip if a rule with equivalent scope + intent
109
- exists.
110
- 2. **Prefer the newer / more specific source_ref** when rules overlap.
111
- 3. **If you merged rules**, record the consolidated sources in `source_ref`:
112
- e.g., `"信披办法 §15.2 + 资管新规 §24"`.
113
-
114
- ### Delegation to sub-agents
115
-
116
- If you dispatch extraction to sub-agents (one per regulation), the sub-agent
117
- inherits ONLY its `task_description` it cannot see your conversation or
118
- existing catalog. Therefore, when composing the brief:
119
-
120
- - **Specify the target count band** explicitly: "Extract 10-20 atomic
121
- rules from this regulation."
122
- - **Include a sample rule** in the brief body (paste the JSON above
123
- verbatim) so the sub-agent's calibration matches yours.
124
- - **Name every regulation the sub-agent should process.** If AGENT.md
125
- lists 10 core regulations, the brief must list all 10 by name, not
126
- "the core regs" as a pronoun — LLMs composing long structured briefs
127
- frequently drop items (observed in session 6304673afaa0 where reg 02
128
- was silently omitted).
129
- - **State the dedup contract**: "Rules already in the parent's catalog
130
- (R001–Rnnn) should NOT be re-extracted. If a requirement is already
131
- covered, skip it." Then pass the current catalog's ID ranges.
132
- - **Prefer `rule_catalog` create operations over sandbox_exec writes to
133
- catalog.json.** rule_catalog uses workspace file locking (B9);
134
- sandbox_exec bypasses it and races with other writers.
135
-
136
- ## 如何读取规则文件 (默认整本读取)
137
-
138
- 法规文件是审核的权威依据。你为每条规则记录的 `source_ref` 都要能在
139
- 原文中复核。对于绝大多数规则文件 (单个文件 < 50 KB / < ~100 页),
140
- **用 `workspace_file` (operation=read) 一次性整本读取**:
98
+ 注意:一个 PASS/FAIL 结果、一个 `source_ref` 指向具体条款、清晰的适用范围。skill-authoring 仅凭这一条 JSON 就能写出 `check_r014.py`。
99
+
100
+ ### 跨源文档去重(处理多份源文档时)
101
+
102
+ 开发者用户给出 N 份源文档时,后面文档里的规则常常和前面已经抽出的规则在内容上重叠(比如较新版本的具体规定 vs. 较老版本的通用规定)。在为第 N 份源文档生成规则前:
103
+
104
+ 1. **先看现有目录**。用 `rule_catalog`(operation: list)查现状。如果已存在等价 scope + 意图的规则,就跳过。
105
+ 2. **冲突时优先用更新、更具体的 `source_ref`**。
106
+ 3. **如果合并了规则**,在 `source_ref` 里写明合并后的来源,例如 `"信披办法 §15.2 + 资管新规 §24"`。
107
+
108
+ ### 派发给子代理
109
+
110
+ 如果你把抽取工作派发给子代理(每份源文档一个),子代理只继承它自己的 `task_description` —— 它看不到你的对话也看不到现有目录。所以在写 brief 时:
111
+
112
+ - **用一条具体的样例规则锚定校准**。把上面的 JSON 原样贴进 brief 正文,让子代理在原子性上的判断和你对齐。
113
+ - **逐字列出子代理需要处理的每一份源文档**。如果 AGENT.md 里列了 10 份核心源文档,brief 也要把这 10 份逐一列出来,不要用"那几份核心法规"这种代词 —— LLM 在写较长的结构化 brief 时频繁会把列表元素静悄悄漏掉。
114
+ - **明确去重契约**:"父级目录中已有的规则(R001–Rnnn)**不要**重抽。如果某条要求已被覆盖,跳过。"然后把当前目录的 ID 范围传过去。
115
+ - **优先用 `rule_catalog` 的 create 操作,而不是 `sandbox_exec` 直接写 catalog.json**。`rule_catalog` 走工作区文件锁;`sandbox_exec` 绕过它,会和其他写入方抢锁。
116
+
117
+ ## 如何读取源文档(默认整本读取)
118
+
119
+ 源文档是规则目录的权威依据。你为每条规则记录的 `source_ref` 都要能在原文中复核。对于绝大多数源文档(单个文件 < 50 KB / < ~100 页),**用 `workspace_file`(operation=read)一次性整本读取**:
141
120
 
142
121
  ```js
143
122
  workspace_file({ operation: "read", scope: "project", path: "Rules/01_某某办法.md" })
144
123
  ```
145
124
 
146
- `workspace_file.read` 单次上限 50,000 字符, 足以覆盖几乎所有单个法规
147
- 文件。这是默认行为: **在抽取规则之前, 把每一份法规文件都整本读一遍。**
125
+ `workspace_file.read` 单次上限 50,000 字符,足以覆盖几乎所有单个源文档。这是默认行为:**在抽取规则之前,把每一份源文档都整本读一遍。**
148
126
 
149
127
  ### 工具选择 — `workspace_file` 还是 `sandbox_exec`
150
128
 
151
129
  | 工具 | 单次上限 | 适用 |
152
130
  |---|---:|---|
153
- | `workspace_file` (read) | 50,000 字符 | **整本读取法规/规则文件** |
154
- | `sandbox_exec` (cat/head) | 10,000 字符 | 短命令, 不适合整文件读取 |
131
+ | `workspace_file` (read) | 50,000 字符 | **整本读取源/规则文件** |
132
+ | `sandbox_exec` (cat/head) | 10,000 字符 | 短命令,不适合整文件读取 |
133
+
134
+ `sandbox_exec` 是为执行 shell 命令设计的,10K 上限对绝大多数源文档太小。`cat rules/01_*.md` 只会返回前 ~10 KB,后面被截断为 `\n[truncated]`。反复用 `head -N` / `tail -M` 滑动窗口会丢失行号位置信息,也浪费交互回合。**遇到截断,别和上限较劲 —— 换工具。**
155
135
 
156
- `sandbox_exec` 是为执行 shell 命令设计的, 10K 上限对绝大多数法规太小。
157
- `cat rules/01_*.md` 只会返回前 ~10 KB, 后面被截断为 `\n[truncated]`。
158
- 反复用 `head -N` / `tail -M` 滑动窗口会丢失行号位置信息, 也浪费交互
159
- 回合。**遇到截断, 别和上限较劲——换工具。**
136
+ ### 源文档与样本的不对称 —— 源文整本读,样本按需抽样
160
137
 
161
- ### 法规与样本的不对称 — 法规整本读, 样本按需抽样
138
+ 源文档通常只有 1–10 份,权威性强,只需读一次。每一份都整本读取,作为后续所有规则抽取与引用的基础。
162
139
 
163
- 法规通常只有 1–10 份, 权威性强, 只需读一次。每一份法规都整本读取,
164
- 作为后续所有规则抽取与引用的基础。
140
+ 样本文档可能 30 份甚至 1000+ 份,异质性强,在测试阶段会被多次读取。**不要试图把每个样本都整本读一遍** —— 用规则适用性过滤、抽样子集来聚焦注意力。
165
141
 
166
- 样本文档可能 30 份甚至 1000+ 份, 异质性强, 在测试阶段会被多次读取。
167
- **不要试图把每个样本都整本读一遍**——用规则适用性过滤、抽样子集来
168
- 聚焦注意力。
142
+ ### 例外 —— 单个源文档超过 200K 字符时
169
143
 
170
- ### 例外 单个法规超过 200K 字符时
144
+ 实践中极少见。绝大多数法规、手册或规则表类源文档都能舒服地放进 50 KB 之内。但如果你确实遇到一份超大源文档,读取整本会挤压上下文窗口(启发式:单文件超过 ~200,000 字符 或超过你上下文预算的 ~25%),此时由你判断:
171
145
 
172
- 实践中极少见。test_data_4 中最大的法规 42 KB; 银行业 资管新规 +
173
- 信披办法 等典型法规也都在 50 KB 以内。但如果你确实遇到一份超大法规,
174
- 读取整本会挤压上下文窗口 (启发式: 单文件超过 ~200,000 字符 或超过你
175
- 上下文预算的 ~25%), 此时由你判断:
146
+ - 按章(`第X章`)分段读,用 `document_parse` 或分页的 `workspace_file`
147
+ - 或建立工作区内的索引文件,标注每章的偏移位置,抽取规则时按需读取
176
148
 
177
- - 按章 (`第X章`) 分段读, 用 `document_parse` 或分页的 `workspace_file`
178
- - 或建立工作区内的索引文件, 标注每章的偏移位置, 抽取规则时按需读取
149
+ 50 KB 的上限已经足够高,上述例外情形几乎不会触发。**默认就是整本读;只有当文件确实太大时才偏离这一默认。**
179
150
 
180
- 50 KB 的上限已经足够高, 上述例外情形几乎不会触发。**默认就是整本读;
181
- 只有当文件确实太大时才偏离这一默认。**
151
+ ## 抽取策略
182
152
 
183
- ## Extraction Strategies
153
+ ### 策略 1:结构化输入(开发者用户提供规则)
184
154
 
185
- ### Strategy 1: Structured Input (Developer User Provides Rules)
155
+ 当开发者用户以 xlsx、csv 或某种结构化文档提供规则、每一行/条目都是一条边界清晰的独立规则时:
156
+ - 严格按对方的结构来,不要再拆。
157
+ - 把每一行映射成一条规则,保留开发者用户的标识符。
158
+ - 只在条目本身有歧义时才回问澄清。
186
159
 
187
- When the developer user provides rules in xlsx, csv, or a structured document where each row/entry is a distinct rule with clear scope:
188
- - Follow their structure exactly. Do not re-decompose.
189
- - Map each row to a rule, preserving the developer user's identifiers.
190
- - Ask clarifying questions only if entries are ambiguous.
160
+ ### 策略 2:从源文中按层级抽取
191
161
 
192
- ### Strategy 2: Hierarchical Extraction from Regulation Text
162
+ 针对原始源文档(PDF、DOCX、法律文本、内部手册等):
193
163
 
194
- For raw regulation documents (PDF, DOCX, legal text):
164
+ 1. **先通览文档结构**。读目录或扫一遍标题,理解层级:编、章、节、条、款。
195
165
 
196
- 1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
197
- 2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
198
- 3. **Peel the onion.** Start at the highest structural level and work downward:
199
- - Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
200
- - Level 2: Within each area, what are the specific chapters or sections?
201
- - Level 3: Within each section, what are the individual requirements?
202
- - Stop peeling when you reach atomic rules.
203
- 4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
204
- 5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
166
+ 在抽取任何一条规则之前,**先把目录与章节标题从头到尾走一遍**。勾勒出"承载规则的层级结构":哪些章节施加义务,哪些是定义/背景。一种常见失败模式:一份很长、条款很多的源文最终只产出非常少的规则——几乎可以肯定你是在"高密度章节"读完之后就停止了通览。先把你的"含规则章节跨度"显式定下来,然后用这个跨度作为参照去解释偏离,而不是用一个全局的目标数字去校准。
205
167
 
206
- For long documents (100+ pages), use the onion-peeler approach described in `references/chunking-strategies.md`. Do not try to read the entire document in one pass.
168
+ 2. **识别承载规则的章节**。并非每个章节都包含核查规则。有些是定义、有些是程序性、有些是背景。聚焦在那些设定义务、禁止、阈值或要求的章节。
169
+ 3. **像剥洋葱一样自顶向下**:
170
+ - Level 1:源文覆盖哪些大主题?(例如资本充足、风险披露、治理)
171
+ - Level 2:每个主题下面具体是哪些章节?
172
+ - Level 3:每个章节里又是哪些独立要求?
173
+ - 一直剥到原子规则为止。
174
+ 4. **处理交叉引用**。源文喜欢说"按第 X 节的定义"或"在第 Y 条的条件下"。把被引用的内容解析进规则的描述里,不要只留一个引用。
175
+ 5. **处理复合规则**。"报告必须包括(a)风险因素、(b)财务预测、(c)管理层讨论"——这是三条规则,不是一条。除非开发者用户明确要求合并,否则拆开。
207
176
 
208
- ### Strategy 3: Expert Notes
177
+ 对很长的文档,用"剥洋葱"做法 —— 完整策略和针对无清晰标题段落的"楔入法"兜底见 `document-chunking` skill。不要试图一遍读完整篇。
209
178
 
210
- Sometimes rules come from the developer user's domain expertise rather than formal regulations:
211
- - "We always check that the guarantor's signature matches the name on page 1."
212
- - "If the collateral value is below 120% of the loan amount, flag it."
179
+ ### 策略 3:专家笔记
213
180
 
214
- Capture these with the same rigor as formal regulation rules. They are equally important in the verification app.
181
+ 有时候规则不来自正式源文,而来自开发者用户的领域经验:
182
+ - "我们一直会核对担保人签字与第 1 页姓名是否一致。"
183
+ - "如果抵押物估值低于贷款金额的 120%,就要 flag。"
215
184
 
216
- ## Rule Catalog
185
+ 用同样的严谨度去捕获这些规则。在 verification app 里,它们与正式源文规则同等重要。
217
186
 
218
- Maintain a lightweight catalog of all extracted rules. This is your index, not the rules themselves (those live in skill folders). The catalog should track:
187
+ ## 规则目录(Rule Catalog)
219
188
 
220
- - Rule ID (simple sequential: R001, R002, ...)
221
- - Rule title (one line)
222
- - Source (which regulation document, which section)
223
- - Status (extracted / skill-written / skill-tested / workflow-written / workflow-tested / production)
224
- - Dependencies (rules that must be checked before this one)
189
+ 为所有抽出的规则维护一份轻量目录。这是索引,不是规则本身(规则本身住在 skill 目录里)。目录应当跟踪:
225
190
 
226
- Format: a simple markdown table or JSON file. Do not over-engineer this. The catalog exists to give you and the developer user an overview of progress.
191
+ - 规则 ID(顺序递增:R001、R002、…)
192
+ - 规则标题(一行)
193
+ - 来源(哪份源文档、哪一节)
194
+ - 状态(extracted / skill-written / skill-tested / workflow-written / workflow-tested / production)
195
+ - 依赖(必须先核查的前置规则)
227
196
 
228
- ## Project Glossary
197
+ 格式:一张简单的 markdown 表或一份 JSON 文件即可。不要过度工程。目录的作用是让你和开发者用户对进度有一个总览。
229
198
 
230
- Alongside the rule catalog, build a project glossary — a living vocabulary of the entities, terms, and patterns the verification system encounters. The glossary is what keeps entity names consistent across rules: without it, the same balance-sheet item might be named "注册资本", "registered capital", and "paid-in capital" by three different rule skills, breaking shared-entity matching and producing inconsistent extraction outputs.
199
+ ## 项目术语表(Glossary)
231
200
 
232
- The glossary is not frozen at the end of extraction. It is a living document. Update it when you discover new aliases in samples, when a worker LLM extraction reveals a variant phrasing, when corner cases surface unfamiliar terminology. Both the coding agent and any operator can edit it.
201
+ 在规则目录之外,建一份项目术语表 —— 一份活的词汇表,记录核查系统会遇到的实体、术语、模式。术语表的作用是在规则间保持实体名称一致:没有它,同一个资产负债表项目可能被三个不同的 rule skill 分别叫成"注册资本"、"registered capital"、"实收资本",shared-entity 匹配会因此断裂,抽取输出也会不一致。
233
202
 
234
- ### When to seed it
203
+ 术语表在抽取阶段结束后并不冻结。它是活文档。当你在样本中发现新的别名、当一次 worker LLM 抽取暴露出一种变体表述、当边缘案例引出陌生术语时,都该更新它。coding agent 和运营人员都可以编辑。
235
204
 
236
- During rule extraction. As you decompose each rule, note the entities the rule references — capital ratios, signature pages, related-party transactions, dates, parties, monetary values. Seed the glossary with the canonical name and any aliases already visible in the source documents.
205
+ ### 何时开始填
237
206
 
238
- ### Storage and shape
207
+ 在规则抽取期间。当你拆解每条规则时,把它引用的实体记下来——资本比率、签字页、关联交易、日期、当事人、金额。先把规范名(canonical)和源文中已经可见的别名播种进去。
239
208
 
240
- Save as `rules/glossary.json` next to `catalog.json`. Each entry is small:
209
+ ### 存储和形状
210
+
211
+ 存为 `rules/glossary.json`,与 `catalog.json` 并列。每条条目很小:
241
212
 
242
213
  ```json
243
214
  {
@@ -250,48 +221,88 @@ Save as `rules/glossary.json` next to `catalog.json`. Each entry is small:
250
221
  }
251
222
  ```
252
223
 
253
- Status field tracks maturity: `extracted` (from rules), `validated` (confirmed in samples), `production` (used by deployed workflows). Add or drop fields as the project demands — same JIT philosophy as the rule schema.
224
+ `status` 字段跟踪成熟度:`extracted`(来自规则)、`validated`(在样本中已确认)、`production`(已被部署的 workflow 使用)。需要时增删字段——和规则 schema 一样是 JIT 哲学。
225
+
226
+ ### 与下游怎么衔接
227
+
228
+ - `rule-graph` 用 glossary,让 `shares_entity` 边引用规范标签而不是自由文本。
229
+ - `entity-extraction` 在设计抽取逻辑时参考 glossary 拿到规范名和已知别名。
230
+ - `skill-authoring` 写出的 skill 在自己的 schema 中使用规范名。
231
+
232
+ 下游怎么用 glossary 是每个项目自行判断的事。一份成熟的 glossary 可能让某些实体的廉价模式匹配成为可能;另一些情况下它只是保证命名一致。由 `entity-extraction` 的成本-精度逻辑按 case 决定。
233
+
234
+ ## 处理歧义
235
+
236
+ 源文经常是模糊的。遇到歧义时:
237
+ 1. 按你的理解先把规则抽出来。
238
+ 2. 在规则描述里**显式**标注歧义点。
239
+ 3. 向开发者用户回问澄清。
240
+ 4. 拿到澄清后更新规则。
241
+
242
+ 不要跳过模糊规则。它们往往是最重要的那些。
243
+
244
+ ## 用样本语料做适用性的健康检查
245
+
246
+ > 这是一次**验证扫描**,不是**发现扫描**。不要看到"0 样本规则"就急着删 —— 先回头问:源文是否要求这条规则?若是,把它标为"future scope"保留,而不是丢弃。
247
+
248
+ 抽完规则目录、写 skill 之前,做这个 5 分钟检查:把每条规则的适用性过滤投影到样本语料上。
249
+
250
+ 对每条规则:
251
+ 1. 走一遍 `samples/`,把每一份按产品类型 / 报告类型 / 文档格式分类。
252
+ 2. 对每条规则,按它的 `applicability` 字段、scope 过滤或目录里对应的形状,数它可能适用于多少份样本。
253
+ 3. 标记适用样本数为 **0** 的规则 —— 它们要么是真正"在本测试语料中不相关"(可接受),要么是 scope 过紧(bug)。
254
+
255
+ 一种值得警惕的失败模式:目录里相当大比例(比如 30-40%)的规则在整批样本上都返回 `PASS=0 FAIL=0 NOT_APPLICABLE=all`。其中部分确实合法(源文要求核查的产品类型恰好不在本批样本中),但只要这个比例偏高,几乎总意味着 scope-too-narrow drift —— 适用性过滤过度具体化了。
256
+
257
+ 如果很多规则都是 0 样本,要么:
258
+ - **改写它们的适用性** —— 放宽产品类型、不仅在正文也在页眉页脚里找证据、松一下 scope 过滤
259
+ - **把它们记作"future scope"** 并从本轮目录中移走(仍然写到 `rules/future_scope.md` 里,避免遗忘)
260
+ - **更新测试语料** 加入匹配的样本(和开发者用户协作)
261
+
262
+ 在 `rule_extraction` 阶段就抓出来,比写 N 条 skill 再到 `skill_testing` 阶段发现它们都不触发要便宜得多。这里花的廉价投影时间,后面能省很多。
263
+
264
+ ## 判断类型分类(覆盖率诊断)
265
+
266
+ 第一遍抽取完成后,按判断类型为每条规则打标签:
267
+
268
+ - **Threshold(阈值型)** —— 数值比较("年化利率 ≥ 15.4%")
269
+ - **Decision-Tree(决策树型)** —— 多分支("若产品类型 ∈ {A, B} 则…")
270
+ - **Heuristic(语义判断型)** —— 语义评估("营销话术是否暗示保本")
271
+ - **Process(过程合规型)** —— 流程合规("是否在规定时限内发布")
272
+
273
+ 如果你的规则集中 90% 都是 Threshold 型,那么"无法归约为数字"的语义/过程类义务很可能被漏掉了。**回头重新通览那些被你略过的章节**。在大多数规则语料中这四种类型的占比通常相对均衡;一旦严重偏斜,就是"漏读章节"的信号。
254
274
 
255
- ### How it integrates
275
+ ## 保留细节(拒绝平滑化)
256
276
 
257
- - `rule-graph` consumes the glossary so `shares_entity` edges reference canonical labels rather than free-text strings.
258
- - `entity-extraction` references the glossary for canonical names and known aliases when designing extraction logic.
259
- - Skills authored under `skill-authoring` should use canonical names in their schemas.
277
+ 在撰写规则的 `description` `falsifiability_statement` 时,**原文中出现的每一个阈值、百分比、时限、命名实体都必须保留**。"在合理期限内披露"是一条模糊规则,下游写不出 check.py 一定会失败——原文几乎肯定写的是"15 个工作日内披露"。如果原文确实模糊(例如只写"及时"而无数字时限),就在规则中**显式标注模糊性**(例如 `notes: "原文使用'及时',无数值时限"`),而不是用自己的判断去平滑掉。下游 skill-authoring 需要这些细节才能写出可执行的检查逻辑。
260
278
 
261
- How the glossary is used downstream is a per-project judgment. A mature glossary may enable cheap pattern-based matching for some entities; for others it just keeps naming consistent. Let the cost-accuracy logic in `entity-extraction` decide per case.
279
+ ## 样本访问的"软纪律"
262
280
 
263
- ## Handling Ambiguity
281
+ KC 不会硬性限制你访问样本的次数——工具调用是开放的。这里的纪律是**流程性纪律**:先走完"源文抽取"阶段,再进入"样本验证"阶段。在源文抽取阶段中,样本只是"术语澄清的兜底参考",而不是"发现规则的入口"。如果你发现自己正在打开第 3 份样本来决定"下一条要抽什么规则"——你已经把方法论倒过来用了,**关掉样本,回到源文**。可以接受的窄例外:
282
+ - 源文中的术语需要靠样本中的实例消歧
283
+ - 验证一条规则的 `description` 字段套用到真实文档时表述是否通顺
264
284
 
265
- Regulations are often ambiguous. When you encounter ambiguity:
266
- 1. Extract the rule as you understand it.
267
- 2. Note the ambiguity explicitly in the rule description.
268
- 3. Ask the developer user for clarification.
269
- 4. Update the rule after receiving clarification.
285
+ ## 主要 vs 辅助源文 —— 决定迭代顺序,不决定覆盖广度
270
286
 
271
- Do not skip ambiguous rules. They are often the most important ones.
287
+ 当开发者用户把一些源文标为"主要"、另一些标为"辅助"(或"补充"、"次要")时,这个区分讲的是**迭代顺序**:主要法规先做深一轮,再回过头处理辅助法规。它**不是**跳过辅助法规的许可证。
272
288
 
273
- ## Sanity-check applicability against the sample corpus
289
+ 一个反复出现的失败模式值得提醒:agent 读到"01-02 是主要依据、其他作为辅助"之后,从 01-02 抽出 13 条规则,从 03-04 抽出 2 条,从 05-10 抽出 0 条。在合规领域,辅助法规通常每部有 60-90 条,几乎一定会承载主要法规引用或预设的核心义务。从它们那里抽零条规则,结果就是一份漏掉真实合规要求的薄规则目录。
274
290
 
275
- After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
291
+ 正确的理解:主要法规先做第一轮深抽,辅助法规至少做一轮结构性扫描 —— 识别其中的核心义务并抽出来,不必和主要法规等密度。完全跳过一部 80 条规模的法规,必须在 `coverage_audit.md` 里给出明确理由(例如"05 号办法覆盖基金运作,超出本 case 范围;已与用户确认明确排除")。**静默跳过就是失败模式。**
276
292
 
277
- For every rule:
278
- 1. Walk `samples/`, classify each by product type / report type / document format
279
- 2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
280
- 3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
293
+ ## 覆盖率追踪表(推荐交付物)
281
294
 
282
- E2E #7 GLM produced a 97-rule catalog where 36 rules (37%) had `PASS=0 FAIL=0 NOT_APPLICABLE=90` across all 90 documents — they never fired. Some were legit (rules for cash-management products with no cash-management samples in corpus), but 36 inactive of 97 was high enough to suggest scope-too-narrow drift.
295
+ 抽取完成后,**逐段走一遍源文档**,为每一段打标:
283
296
 
284
- If many rules are 0-sample, either:
285
- - **Reframe their applicability** broaden product types, look for evidence in headers/footers not just body, relax the scope filter
286
- - **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
287
- - **Update the test corpus** to include matching samples (work with the developer user)
297
+ - `covered_by: [Rxxx, Ryyy]` —— 该条款的义务被哪些规则覆盖
298
+ - `non_checkable: definition | context | cross_ref | scope` —— 该条款因属于定义/背景/纯引用/适用范围说明而被显式排除,并标注原因
288
299
 
289
- Catching this in `rule_extraction` is much cheaper than authoring 36 skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
300
+ 把结果写进 `rules/coverage_trace.md`(或合并到 `coverage_audit.md` 的一个章节)。这张表是已有的"样本侧适用性扫描"的**源文一侧的镜像**,能直接命中"长源文 规则少得可疑"这种失败模式。Engine 后续可以读取这张表来校验完整性。
290
301
 
291
- ## When Rules Change
302
+ ## 规则发生变化时
292
303
 
293
- Regulations evolve. When the developer user adds new or updated regulation documents:
294
- 1. Identify which existing rules are affected.
295
- 2. Extract new rules or update existing ones.
296
- 3. Mark affected workflows for re-testing.
297
- 4. Use `version-control` to track the change.
304
+ 源文档会演化。当开发者用户新增或更新源文档时:
305
+ 1. 识别受影响的现有规则。
306
+ 2. 抽出新规则或更新现有规则。
307
+ 3. 把受影响的 workflow 标记为需要重测。
308
+ 4. `version-control` 跟踪这次变更。
@@ -1,80 +1,7 @@
1
- # Chunking Strategies for Long Documents
1
+ # 分块策略
2
2
 
3
- When regulation documents exceed what you can process in a single pass, use these proven strategies to decompose them into manageable chunks while preserving semantic coherence.
3
+ 分块方法论(剥洋葱法 + 楔入法兜底 + 平衡启发式)已迁移到
4
+ `document-chunking` skill。为规则抽取或任何下游处理设计分块时,
5
+ 直接查阅那条 skill。
4
6
 
5
- ## The Onion Peeler (Primary Strategy)
6
-
7
- Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
8
-
9
- ### How It Works
10
-
11
- 1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
12
- 2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
13
- 3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
14
- 4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
15
- 5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
16
-
17
- ### Why This Works
18
-
19
- - Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
20
- - Minimizes information loss. You never cut in the middle of a thought.
21
- - Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
22
-
23
- ### Pattern Discovery Shortcut
24
-
25
- Before building a full parser, explore several sample documents for structural patterns:
26
- - Do all chapter titles start with "Chapter X" or "第X章"?
27
- - Are sections numbered consistently (1.1, 1.2, 1.3)?
28
- - Are there visual markers (bold text, specific fonts, horizontal rules)?
29
-
30
- If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
31
- - `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
32
- - `^Chapter \d+` for English chapter headers
33
- - `^\d+\.\d+` for numbered sections
34
-
35
- Always validate the regex against multiple documents before committing to it.
36
-
37
- ## Wedge Driving (Fallback Strategy)
38
-
39
- For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
40
-
41
- ### How It Works
42
-
43
- The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
44
-
45
- **Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
46
-
47
- **Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
48
- - `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
49
- - `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
50
- - `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
51
-
52
- Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
53
-
54
- **Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
55
- 1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
56
- 2. The cut position is immediately after the matched `tokens_before` region.
57
- 3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
58
-
59
- **Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
60
-
61
- ### Why This Works
62
-
63
- - The LLM identifies semantic boundaries, not arbitrary character counts.
64
- - The LLM never regenerates text — it only quotes positions. No hallucination risk.
65
- - K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
66
- - The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
67
- - Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
68
-
69
- ### When to Use
70
-
71
- - Only when the onion peeler cannot split further (no sub-headers available).
72
- - For documents with no structural markup at all.
73
- - Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
74
-
75
- ## Practical Guidelines
76
-
77
- - **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
78
- - **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
79
- - **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
80
- - **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
7
+ 本文件保留作为旧引用的占位,新内容写到 `document-chunking`。