kc-beta 0.1.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. package/bin/kc-beta.js +14 -2
  2. package/package.json +1 -1
  3. package/src/agent/context-window.js +151 -0
  4. package/src/agent/context.js +8 -4
  5. package/src/agent/engine.js +261 -8
  6. package/src/agent/event-log.js +111 -0
  7. package/src/agent/llm-client.js +352 -59
  8. package/src/agent/pipelines/base.js +6 -0
  9. package/src/agent/pipelines/distillation.js +18 -0
  10. package/src/agent/pipelines/extraction.js +21 -0
  11. package/src/agent/pipelines/initializer.js +75 -14
  12. package/src/agent/pipelines/production-qc.js +19 -0
  13. package/src/agent/pipelines/skill-authoring.js +14 -0
  14. package/src/agent/pipelines/skill-testing.js +20 -0
  15. package/src/agent/retry.js +83 -0
  16. package/src/agent/session-state.js +79 -0
  17. package/src/agent/skill-loader.js +13 -1
  18. package/src/agent/token-counter.js +62 -0
  19. package/src/agent/tools/document-parse.js +104 -21
  20. package/src/agent/tools/document-search.js +24 -8
  21. package/src/agent/tools/sandbox-exec.js +16 -5
  22. package/src/agent/tools/web-search.js +107 -0
  23. package/src/agent/tools/worker-llm-call.js +14 -5
  24. package/src/agent/tools/workspace-file.js +47 -20
  25. package/src/agent/workspace.js +24 -1
  26. package/src/cli/components.js +24 -5
  27. package/src/cli/config.js +340 -0
  28. package/src/cli/index.js +113 -11
  29. package/src/cli/onboard.js +216 -53
  30. package/src/config.js +63 -10
  31. package/src/model-tiers.json +153 -0
  32. package/src/providers.js +367 -0
  33. package/template/AGENT.md +20 -0
  34. package/template/skills/en/meta/compliance-judgment/SKILL.md +10 -42
  35. package/template/skills/en/meta/document-chunking/SKILL.md +32 -0
  36. package/template/skills/en/meta/document-parsing/SKILL.md +11 -18
  37. package/template/skills/en/meta/entity-extraction/SKILL.md +13 -28
  38. package/template/skills/en/meta/tree-processing/SKILL.md +19 -1
  39. package/template/skills/en/meta-meta/auto-model-selection/SKILL.md +53 -0
  40. package/template/skills/en/meta-meta/pdf-review-dashboard/SKILL.md +57 -0
  41. package/template/skills/en/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
  42. package/template/skills/en/meta-meta/rule-extraction/SKILL.md +24 -1
  43. package/template/skills/en/meta-meta/skill-authoring/SKILL.md +6 -0
  44. package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +4 -0
  45. package/template/skills/zh/meta/compliance-judgment/SKILL.md +41 -262
  46. package/template/skills/zh/meta/document-chunking/SKILL.md +32 -0
  47. package/template/skills/zh/meta/document-parsing/SKILL.md +65 -132
  48. package/template/skills/zh/meta/entity-extraction/SKILL.md +68 -230
  49. package/template/skills/zh/meta/tree-processing/SKILL.md +82 -194
  50. package/template/skills/zh/meta-meta/auto-model-selection/SKILL.md +51 -0
  51. package/template/skills/zh/meta-meta/pdf-review-dashboard/SKILL.md +55 -0
  52. package/template/skills/zh/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
  53. package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +79 -164
  54. package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +64 -185
  55. package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +95 -216
@@ -3,273 +3,152 @@ name: skill-to-workflow
3
3
  description: Distill a proven verification skill into a Python workflow with worker LLM prompts. Use when a rule skill has been tested and reaches the SKILL_ACCURACY threshold defined in .env. Covers the decision of what to implement as code vs LLM calls, prompt engineering for small context windows, model tier selection and progressive downgrade, and testing workflows against the coding agent's own results as ground truth. Also use when optimizing existing workflows for cost or speed.
4
4
  ---
5
5
 
6
- # 技能到工作流的蒸馏
6
+ # Skill to Workflow
7
7
 
8
- ## 蒸馏的本质
8
+ The skill is the ground truth. The workflow is a cheaper, faster approximation. Your job is to make the approximation as good as the original while being as cheap as possible.
9
9
 
10
- 技能(Skill)是标准答案。它由你这个编程智能体直接执行,调用最强的模型、拥有完整的上下文、不计成本地追求准确率。
10
+ ## Engineering Goal
11
11
 
12
- 工作流(Workflow)是技能的廉价近似。它由 Python 代码驱动,调用更小、更便宜的Worker LLM(执行模型),在有限的上下文窗口内完成核查。
12
+ Optimize the full chain: **shortest workflow** (fewest nodes) → **smallest model per node** (cheapest tier that meets accuracy) → **shortest prompt per model** (minimum tokens). This is the engineering objective — not prompt template sophistication or framework compliance.
13
13
 
14
- 蒸馏的目标:**在成本大幅降低的前提下,尽可能逼近技能的准确率。**
14
+ ## When to Start
15
15
 
16
- 这不是翻译。你不是把 SKILL.md 翻译成 Python。你是在重新设计核查流程,让一个能力更弱的执行者也能做对。
16
+ A skill is ready for workflow distillation when:
17
+ - It has been tested on all documents in Samples/.
18
+ - Its accuracy meets or exceeds the SKILL_ACCURACY threshold in `.env`.
19
+ - Edge cases are documented in the skill's `assets/corner_cases.json`.
20
+ - You understand the rule well enough to explain exactly how you verify it.
17
21
 
18
- ## 启动蒸馏的前提条件
22
+ If any of these are not true, go back and iterate on the skill first.
19
23
 
20
- 必须同时满足以下条件才能启动蒸馏:
24
+ ## The Distillation Decision
21
25
 
22
- 1. **技能测试准确率达标**:在 `assets/samples.json` `assets/corner_cases.json` 上的准确率 `.env` 中的 `SKILL_ACCURACY` 阈值
23
- 2. **边界案例已充分记录**:至少覆盖了已知的主要例外情形
24
- 3. **判定逻辑已稳定**:最近两轮迭代没有对核心判定逻辑做出修改
26
+ For each step in your skill-based verification process, ask:
25
27
 
26
- 如果技能本身还在频繁迭代,不要急于蒸馏。等它稳定下来。
28
+ ### Can this be done with regex or Python? (Cost: zero)
29
+ - Date extraction with known formats → regex
30
+ - Numeric comparison against threshold → Python arithmetic
31
+ - Chinese numeral conversion → Python lookup table
32
+ - Format validation (ID numbers, codes) → regex
33
+ - Table cell extraction from structured markdown → string manipulation
27
34
 
28
- ## 蒸馏决策:代码还是模型调用
35
+ If yes, write it as code. These are free, fast, and deterministic.
29
36
 
30
- 这是蒸馏过程中最关键的决策。原则很简单:
37
+ ### Does this require language understanding? (Cost: worker LLM call)
38
+ - Finding the relevant section in a document → LLM
39
+ - Extracting an entity described in natural language → LLM
40
+ - Judging semantic adequacy ("adequate risk disclosure") → LLM
41
+ - Resolving ambiguous references → LLM
31
42
 
32
- ### Python 代码实现(零成本)
43
+ If yes, design a worker LLM prompt. Use the smallest model tier that maintains accuracy.
33
44
 
34
- - 日期比较、金额计算、税率换算
35
- - 正则匹配(发票号码格式、统一社会信用代码校验)
36
- - 字段存在性检查
37
- - 格式标准化(大小写、日期格式转换)
38
- - 枚举值校验(货币代码、国别代码)
39
- - 数学运算(价税合计 = 不含税金额 × (1 + 税率))
45
+ ### The hybrid approach (most common)
46
+ Most rules are a mix: regex extracts the number, Python compares it to the threshold, LLM handles the exceptional cases. Design the workflow as a pipeline where cheap steps run first and expensive steps run only when needed.
40
47
 
41
- ### 用Worker LLM调用实现(有成本)
48
+ ## Workflow Structure
42
49
 
43
- - 从非结构化文本中提取关键信息
44
- - 理解自然语言描述的业务含义
45
- - 判断两段文字的语义是否一致
46
- - 识别和解析复杂的表格结构
47
- - 分类判断(如:该笔费用属于哪个科目)
48
-
49
- ### 混合方案(推荐)
50
-
51
- 大多数核查规则的最优实现是混合方案:
52
-
53
- ```
54
- 1. Python 预处理(格式化、提取结构化字段)
55
- 2. LLM 调用(语义理解、非结构化信息提取)
56
- 3. Python 后处理(逻辑判断、计算、格式化输出)
57
- ```
58
-
59
- 把 LLM 调用夹在中间,用代码限制它的输入范围和输出格式。这样既能利用 LLM 的语义能力,又能用代码保证确定性。
60
-
61
- ## 工作流文件结构
50
+ A workflow is a Python file (or small set of files) in `workflows/`:
62
51
 
63
52
  ```
64
- workflows/R001-invoice-date-validity/
65
- ├── workflow_v1.py # 主流程代码
66
- ├── prompts/ # Worker LLM的提示词模板
67
- │ ├── extract_dates.md # 日期提取提示词
68
- │ └── judge_validity.md # 有效性判断提示词
69
- ├── config.json # 配置(模型层级、参数)
70
- └── CHANGELOG.md # 变更记录
53
+ workflows/
54
+ rule_001_capital_adequacy/
55
+ workflow_v1.py # The main workflow script
56
+ prompts/
57
+ extract.txt # Worker LLM prompt for extraction
58
+ judge.txt # Worker LLM prompt for judgment (if needed)
59
+ config.json # Model assignments, thresholds
71
60
  ```
72
61
 
73
- ### workflow_v1.py 的结构要求
62
+ The workflow file should have a clear entry point:
74
63
 
75
64
  ```python
76
- """
77
- R001 - 发票日期有效性核查工作流
78
- 蒸馏自: rule-skills/R001-invoice-date-validity/
79
- 技能准确率: 95%
80
- 蒸馏日期: 2025-04-01
81
- """
82
-
83
- import json
84
- import os
85
- from pathlib import Path
86
-
87
- def run_verification(document_data: dict, config: dict) -> dict:
65
+ def verify(document_text: str, config: dict) -> dict:
88
66
  """
89
- 工作流入口函数。
90
-
91
- Args:
92
- document_data: 待核查的单据数据
93
- config: 运行时配置(模型选择、API地址等)
94
-
95
67
  Returns:
96
- 标准核查结果字典
68
+ {
69
+ "rule_id": "R001",
70
+ "result": "pass" | "fail" | "missing" | "error",
71
+ "extracted_value": ...,
72
+ "confidence": 0.0-1.0,
73
+ "comment": "..." (only when fail),
74
+ "model_used": "...",
75
+ "llm_calls": int,
76
+ "llm_tokens": int
77
+ }
97
78
  """
98
- # 步骤1: 预处理(纯代码)
99
- # 步骤2: LLM提取(如需要)
100
- # 步骤3: 逻辑判断(纯代码)
101
- # 步骤4: 格式化输出
102
- pass
103
- ```
104
-
105
- 入口函数必须是 `run_verification`,签名固定。这样质量监控和批量处理可以统一调度。
106
-
107
- ### config.json
108
-
109
- ```json
110
- {
111
- "rule_id": "R001",
112
- "rule_name": "发票日期有效性",
113
- "distilled_from": "rule-skills/R001-invoice-date-validity/",
114
- "version": "v1",
115
- "model_tier": "TIER3",
116
- "llm_steps": ["extract_dates"],
117
- "code_steps": ["normalize_format", "compare_dates", "format_output"],
118
- "estimated_cost_per_doc": 0.002,
119
- "api_base_url": "${API_BASE_URL}",
120
- "api_key": "${API_KEY}"
121
- }
122
- ```
123
-
124
- ## Worker LLM的提示词工程
125
-
126
- Worker LLM不是你。它的上下文窗口更小,推理能力更弱,对业务背景一无所知。提示词必须为它的局限性做设计。
127
-
128
- ### 自包含原则
129
-
130
- 提示词不能假设Worker LLM知道任何背景信息。所有必要的上下文都要在提示词中显式提供:
131
-
132
- ```markdown
133
- 你是一个单据信息提取助手。你的任务是从以下发票文本中提取开票日期。
134
-
135
- 提取规则:
136
- - 查找「开票日期」或「Date of Issue」字段
137
- - 日期格式统一输出为 YYYY-MM-DD
138
- - 如果找不到日期,输出 null
139
- - 只提取日期,不要做任何判断
140
-
141
- 发票文本:
142
- {invoice_text}
143
- ```
144
-
145
- ### 结构化输出强制
146
-
147
- Worker LLM的输出必须是可解析的。在提示词中明确要求 JSON 格式输出:
148
-
149
- ```markdown
150
- 请严格按照以下 JSON 格式输出,不要输出任何其他内容:
151
-
152
- {
153
- "invoice_date": "YYYY-MM-DD 或 null",
154
- "extraction_confidence": "high / medium / low"
155
- }
156
79
  ```
157
80
 
158
- ### 收窄上下文
159
-
160
- 不要把整篇文档丢给Worker LLM。只传入它需要处理的那部分内容:
161
-
162
- - 如果只需要提取发票日期,只传发票头部区域的文本
163
- - 如果需要比对合同信息,只传合同中的相关条款段落
164
- - 上下文越窄,提取越准,成本越低
165
-
166
- ### 使用单据语言
167
-
168
- 提示词的指令语言应该与单据语言一致。核查中文单据时,提示词用中文写。这样可以避免Worker LLM在语言切换中引入错误。
169
-
170
- ### 少量示例策略
171
-
172
- 在提示词中提供 1-2 个精简的输入输出示例,但不要过多:
173
-
174
- - Worker LLM的上下文窗口有限,示例太多会挤占正文空间
175
- - 选择最典型的正例和一个常见的异常例
176
- - 示例要简短,只展示关键特征
177
-
178
- ## 模型层级选择与逐级降级
81
+ This is a reference, not a rigid contract. Adapt the structure to the specific rule. The important thing is that every workflow produces a result that can be compared against the skill-based ground truth.
179
82
 
180
- ### 选择策略
83
+ ## Prompt Engineering for Worker LLMs
181
84
 
182
- TIER1 开始,逐步尝试更低层级。`.env` 中定义了四个层级:
85
+ Worker LLMs have smaller context windows (typically 16K-32K tokens). Design prompts that:
183
86
 
184
- - `TIER1`:最强,适合复杂的语义理解和多步推理
185
- - `TIER2`:中等,适合需要一定推理的提取和判断
186
- - `TIER3`:轻量,适合结构化信息提取
187
- - `TIER4`:最便宜,适合简单的格式提取和分类
87
+ 1. **Are self-contained.** Include everything the model needs in the prompt. Do not assume the model has context from previous calls.
88
+ 2. **Specify the output format.** "Return a JSON object with fields: value, confidence, reasoning." Structured output reduces parsing errors.
89
+ 3. **Include the narrowed context.** Do not send the entire document. Use the tree-processing pipeline (full document → relevant chapter → relevant section) to narrow the context before calling the worker LLM.
90
+ 4. **Are written in the document's language.** Chinese documents get Chinese prompts. English documents get English prompts. Do not mix languages in a single prompt.
91
+ 5. **Provide examples sparingly.** One or two examples help. Ten examples waste context window and risk overfitting.
188
92
 
189
- ### 降级流程
93
+ ## Model Tier Selection
190
94
 
191
- ```
192
- 1. 用 TIER1 运行全部测试样本,确立准确率天花板
193
- 2. 用 TIER2 运行同一批测试样本,与 TIER1 结果对比
194
- 3. 如果 TIER2 准确率接近 TIER1 → 继续尝试 TIER3
195
- 4. 如果 TIER3 仍然接近 → 继续尝试 TIER4
196
- 5. 选择满足 WORKFLOW_ACCURACY 阈值的最低层级
197
- 6. 如果 TIER1 本身都不达标 → 回到技能层面检查提示词设计
198
- ```
199
-
200
- 注意:不同步骤可以使用不同层级。比如日期提取用 TIER4,语义判断用 TIER2。在 config.json 中按步骤记录最优层级。
201
-
202
- ### 正式降级协议
203
-
204
- 以下数值和流程是推荐起点,编程智能体和开发者用户应根据实际情况自由调整。重要的是模式本身(测试 → 对比 → 记录 → 退化时重评),而非具体数字。
95
+ Start with the highest tier (TIER1) for each step. Measure accuracy. Then try lower tiers:
205
96
 
206
- **方向**:自上而下。先用 TIER1 建立准确率天花板,再逐级尝试更低层级,找到成本与准确率的最优平衡点。
97
+ 1. Run the workflow with TIER1 on all Samples/. Record accuracy per step.
98
+ 2. For each step, try TIER2. If accuracy stays above WORKFLOW_ACCURACY, keep TIER2.
99
+ 3. Continue downgrading per step until accuracy drops below threshold.
100
+ 4. Record the optimal tier per step in `config.json`.
207
101
 
208
- **最低测试量**:每个候选层级至少运行 min(10, total_samples) 篇文档。样本量太少则结论不可靠。
102
+ Different steps within the same workflow can use different model tiers. Extraction might need TIER2 while judgment might work fine with TIER3.
209
103
 
210
- **准确率差值判定**:若低一级模型的准确率显著低于上一级(建议阈值:>5个百分点),则停留在较高层级。例如 TIER1 达到 96%、TIER2 只有 89%,则该步骤选定 TIER1。
104
+ ### Formal Downgrade Protocol
211
105
 
212
- **逐步骤独立评估**:工作流中每个 LLM 调用步骤独立评估模型层级。步骤 A 可能用 TIER3,步骤 B 可能需要 TIER1。最终结果按步骤分别记录在 config.json `llm_steps` 配置中。
106
+ The basic approach above works, but a more rigorous protocol prevents premature tier commitments:
213
107
 
214
- **退化触发重评**:生产环境质控发现准确率下降时(如 `quality-control` 技能检测到的退化信号),应对相关步骤重新执行降级评估。模型供应商更新、数据分布漂移都可能导致原有选择失效。
108
+ **Direction**: Start top-down (TIER1 → TIER4) to establish the accuracy ceiling first. You need to know the best possible accuracy before trading it for cost savings.
215
109
 
216
- **模型-任务推荐表**:在项目级别维护 task_type tier 的映射表,积累经验数据。例如「中文发票日期提取 TIER4」「合同语义比对 TIER1」。随着测试轮次增多,这张表会成为新规则蒸馏的起点参考。
110
+ **Minimum test runs**: Run at least a meaningful number of documents (e.g., min(10, total_samples)) at each candidate tier before making a tier decision. Small samples are unreliable — a 3-document test could be misleading.
217
111
 
218
- **与文档解析的一致性**:此降级框架与 `document-parsing` 技能中解析器逐级升级的机制同构——都是在层级间做测试、对比、选择。两者可复用相同的评估脚本和判定逻辑。
112
+ **Accuracy delta trigger**: If a lower tier's accuracy is significantly below the higher tier (e.g., >5 percentage points), stay at the higher tier for that step. If the delta is within tolerance, use the cheaper tier.
219
113
 
220
- ## 对照真值测试
114
+ **Per-step independence**: Each workflow step is assessed separately. Record the optimal tier per step in `config.json`. Do not assume the whole workflow must use one tier.
221
115
 
222
- 技能的核查结果就是真值(Ground Truth)。工作流的测试方法是与技能结果逐字段对比。
116
+ **Re-assessment trigger**: If production quality control shows a step's accuracy degrading (e.g., due to new document formats), re-run the tier assessment for that step.
223
117
 
224
- ### 对比维度
118
+ **Model-task recommendation list**: Maintain a per-project mapping of (task_type → recommended_tier) based on your testing experience. Over time, these lists can be collected across projects to build generalized tier recommendations.
225
119
 
226
- - **判定一致性**:工作流的 verdict 是否与技能的 verdict 一致
227
- - **字段提取准确性**:工作流提取的字段值是否与技能提取的一致
228
- - **置信度校准**:工作流报告高置信度的案例,是否确实准确率更高
120
+ All numbers here (10 documents, 5 percentage points, etc.) are recommended starting points. The coding agent and developer user should calibrate these — or replace them entirely with a different assessment approach — based on their specific volume, accuracy requirements, and cost constraints. The pattern matters: **test at each tier → compare accuracy → commit when within tolerance → re-assess on degradation**.
229
121
 
230
- ### 准确率计算
122
+ This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
231
123
 
232
- ```
233
- 工作流准确率 = 与技能判定一致的案例数 / 总案例数
234
- ```
235
-
236
- 分别计算总体准确率和分类准确率(通过、不通过、无法核查各自的准确率),避免类别不均衡导致的误判。
124
+ ## Testing Against Ground Truth
237
125
 
238
- ## 版本管理
126
+ The coding agent's skill-based results are the ground truth. For each document in Samples/:
239
127
 
240
- 工作流的迭代以文件版本号标识:
128
+ 1. Run the workflow.
129
+ 2. Compare the workflow's result against the skill-based result.
130
+ 3. Log discrepancies: which step failed, what was expected vs actual.
131
+ 4. Compute accuracy: `(matching results) / (total documents)`.
132
+ 5. If accuracy < WORKFLOW_ACCURACY, diagnose and fix. Use `evolution-loop` methodology.
241
133
 
242
- - `workflow_v1.py` → 初始蒸馏版本
243
- - `workflow_v2.py` → 优化提示词后的版本
244
- - `workflow_v3.py` → 更换模型层级后的版本
134
+ ## Versioning
245
135
 
246
- 不要覆盖旧版本文件。保留完整的版本历史,便于回退和对比。
136
+ Each iteration of a workflow is a new version file: `workflow_v1.py`, `workflow_v2.py`, etc. Track which version is active in `config.json`. See `version-control` skill for the full methodology.
247
137
 
248
- ## 成本追踪
138
+ ## Cost Tracking
249
139
 
250
- 每次工作流运行都记录成本数据:
251
-
252
- ```json
253
- {
254
- "rule_id": "R001",
255
- "workflow_version": "v2",
256
- "document_id": "DOC-001",
257
- "llm_calls": 2,
258
- "total_tokens": 1850,
259
- "estimated_cost_usd": 0.003,
260
- "model_used": "TIER3",
261
- "timestamp": "2025-04-01T10:30:00Z"
262
- }
263
- ```
140
+ Track the cost of each workflow run:
141
+ - Number of LLM calls per document.
142
+ - Total tokens consumed per document.
143
+ - Model tier used per call.
264
144
 
265
- 汇总后用于评估单据平均核查成本,指导模型层级优化方向。
145
+ This data helps the developer user understand the production cost and informs further optimization.
266
146
 
267
- ## SiliconFlow API 配置说明
147
+ ## Worker LLM API
268
148
 
269
- 工作流中调用Worker LLM时,通过 `.env` 中配置的 `API_BASE_URL` `API_KEY` 连接到 SiliconFlow 或其他兼容的 API 服务。
149
+ Worker LLMs are accessed via SiliconFlow API. Connection details are in `.env`:
150
+ - `SILICONFLOW_API_KEY` for authentication
151
+ - `SILICONFLOW_BASE_URL` for the API endpoint
152
+ - Model names in `TIER1` through `TIER4`
270
153
 
271
- 调用时注意:
272
- - 使用标准的 OpenAI 兼容接口格式
273
- - 设置合理的超时和重试机制
274
- - 对 API 错误做好降级处理(如某模型不可用时切换到备选模型)
275
- - 记录每次调用的 token 用量和响应时间
154
+ See `references/worker-llm-catalog.md` for current model capabilities and context window sizes.