kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +47 -0
  2. package/package.json +3 -2
  3. package/src/agent/context.js +17 -1
  4. package/src/agent/engine.js +467 -100
  5. package/src/agent/llm-client.js +24 -1
  6. package/src/agent/pipelines/_advance-hints.js +92 -0
  7. package/src/agent/pipelines/_milestone-derive.js +325 -20
  8. package/src/agent/pipelines/skill-authoring.js +49 -3
  9. package/src/agent/tools/agent-tool.js +2 -2
  10. package/src/agent/tools/consult-skill.js +15 -0
  11. package/src/agent/tools/dashboard-render.js +48 -1
  12. package/src/agent/tools/document-parse.js +31 -2
  13. package/src/agent/tools/phase-advance.js +17 -13
  14. package/src/agent/tools/release.js +343 -7
  15. package/src/agent/tools/sandbox-exec.js +65 -8
  16. package/src/agent/tools/worker-llm-call.js +95 -15
  17. package/src/agent/workspace.js +25 -4
  18. package/src/cli/components.js +4 -1
  19. package/src/cli/index.js +125 -8
  20. package/src/config.js +19 -2
  21. package/src/marathon/driver.js +217 -0
  22. package/src/marathon/prompts.js +93 -0
  23. package/template/.env.template +17 -1
  24. package/template/AGENT.md +2 -2
  25. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  26. package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
  27. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  28. package/template/skills/en/confidence-system/SKILL.md +30 -8
  29. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  30. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  31. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  32. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  33. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  34. package/template/skills/en/document-chunking/SKILL.md +99 -15
  35. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  36. package/template/skills/en/quality-control/SKILL.md +23 -0
  37. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  38. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  39. package/template/skills/en/skill-authoring/SKILL.md +85 -2
  40. package/template/skills/en/skill-creator/SKILL.md +25 -3
  41. package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
  42. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  43. package/template/skills/en/tree-processing/SKILL.md +1 -1
  44. package/template/skills/en/version-control/SKILL.md +15 -0
  45. package/template/skills/en/work-decomposition/SKILL.md +52 -32
  46. package/template/skills/phase_skills.yaml +5 -0
  47. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  48. package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
  49. package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
  50. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  51. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  52. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  53. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  54. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  55. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  56. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  57. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  58. package/template/skills/zh/document-chunking/SKILL.md +101 -18
  59. package/template/skills/zh/document-parsing/SKILL.md +65 -65
  60. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  61. package/template/skills/zh/entity-extraction/SKILL.md +78 -68
  62. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  63. package/template/skills/zh/quality-control/SKILL.md +23 -0
  64. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  65. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  66. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  67. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  68. package/template/skills/zh/skill-authoring/SKILL.md +136 -58
  69. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  70. package/template/skills/zh/skill-creator/SKILL.md +215 -201
  71. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  72. package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
  73. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  74. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  75. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  76. package/template/skills/zh/tree-processing/SKILL.md +67 -63
  77. package/template/skills/zh/version-control/SKILL.md +15 -0
  78. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  79. package/template/skills/zh/work-decomposition/SKILL.md +52 -30
  80. package/template/workflows/common/llm_client.py +168 -0
  81. package/template/workflows/common/utils.py +132 -0
@@ -1,12 +1,12 @@
1
1
  # JSON Schemas
2
2
 
3
- This document defines the JSON schemas used by skill-creator.
3
+ 本文档定义 skill-creator 使用的各类 JSON schema。
4
4
 
5
5
  ---
6
6
 
7
7
  ## evals.json
8
8
 
9
- Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
9
+ 定义某个技能的评估项。位于该技能目录下的 `evals/evals.json`。
10
10
 
11
11
  ```json
12
12
  {
@@ -26,19 +26,19 @@ Defines the evals for a skill. Located at `evals/evals.json` within the skill di
26
26
  }
27
27
  ```
28
28
 
29
- **Fields:**
30
- - `skill_name`: Name matching the skill's frontmatter
31
- - `evals[].id`: Unique integer identifier
32
- - `evals[].prompt`: The task to execute
33
- - `evals[].expected_output`: Human-readable description of success
34
- - `evals[].files`: Optional list of input file paths (relative to skill root)
35
- - `evals[].expectations`: List of verifiable statements
29
+ **字段:**
30
+ - `skill_name`:与技能 frontmatter 中的 name 一致
31
+ - `evals[].id`:唯一的整数标识
32
+ - `evals[].prompt`:要执行的任务
33
+ - `evals[].expected_output`:以人类可读方式描述的成功标准
34
+ - `evals[].files`:可选的输入文件路径列表(相对于技能根目录)
35
+ - `evals[].expectations`:可被核验的断言列表
36
36
 
37
37
  ---
38
38
 
39
39
  ## history.json
40
40
 
41
- Tracks version progression in Improve mode. Located at workspace root.
41
+ 跟踪 Improve 模式下的版本演进。位于工作区根目录。
42
42
 
43
43
  ```json
44
44
  {
@@ -71,21 +71,21 @@ Tracks version progression in Improve mode. Located at workspace root.
71
71
  }
72
72
  ```
73
73
 
74
- **Fields:**
75
- - `started_at`: ISO timestamp of when improvement started
76
- - `skill_name`: Name of the skill being improved
77
- - `current_best`: Version identifier of the best performer
78
- - `iterations[].version`: Version identifier (v0, v1, ...)
79
- - `iterations[].parent`: Parent version this was derived from
80
- - `iterations[].expectation_pass_rate`: Pass rate from grading
81
- - `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
82
- - `iterations[].is_current_best`: Whether this is the current best version
74
+ **字段:**
75
+ - `started_at`:改进流程启动时间的 ISO 时间戳
76
+ - `skill_name`:正在改进的技能名
77
+ - `current_best`:当前表现最好的版本标识
78
+ - `iterations[].version`:版本标识(v0v1、…)
79
+ - `iterations[].parent`:本版本派生自哪个父版本
80
+ - `iterations[].expectation_pass_rate`:评分得出的通过率
81
+ - `iterations[].grading_result`:"baseline""won""lost" "tie"
82
+ - `iterations[].is_current_best`:本版本是否是当前最佳版本
83
83
 
84
84
  ---
85
85
 
86
86
  ## grading.json
87
87
 
88
- Output from the grader agent. Located at `<run-dir>/grading.json`.
88
+ 由评分智能体输出。位于 `<run-dir>/grading.json`。
89
89
 
90
90
  ```json
91
91
  {
@@ -149,20 +149,20 @@ Output from the grader agent. Located at `<run-dir>/grading.json`.
149
149
  }
150
150
  ```
151
151
 
152
- **Fields:**
153
- - `expectations[]`: Graded expectations with evidence
154
- - `summary`: Aggregate pass/fail counts
155
- - `execution_metrics`: Tool usage and output size (from executor's metrics.json)
156
- - `timing`: Wall clock timing (from timing.json)
157
- - `claims`: Extracted and verified claims from the output
158
- - `user_notes_summary`: Issues flagged by the executor
159
- - `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
152
+ **字段:**
153
+ - `expectations[]`:评过分的期望,附证据
154
+ - `summary`:通过/失败的汇总计数
155
+ - `execution_metrics`:工具使用与输出体量(来自 executor metrics.json
156
+ - `timing`:墙钟时间(来自 timing.json
157
+ - `claims`:从输出中抽取并核实的论断
158
+ - `user_notes_summary`:executor 标记的问题
159
+ - `eval_feedback`:(可选)针对评估项的改进建议,仅当评分智能体发现值得提出的问题时才出现
160
160
 
161
161
  ---
162
162
 
163
163
  ## metrics.json
164
164
 
165
- Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
165
+ 由执行智能体输出。位于 `<run-dir>/outputs/metrics.json`。
166
166
 
167
167
  ```json
168
168
  {
@@ -183,22 +183,22 @@ Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
183
183
  }
184
184
  ```
185
185
 
186
- **Fields:**
187
- - `tool_calls`: Count per tool type
188
- - `total_tool_calls`: Sum of all tool calls
189
- - `total_steps`: Number of major execution steps
190
- - `files_created`: List of output files created
191
- - `errors_encountered`: Number of errors during execution
192
- - `output_chars`: Total character count of output files
193
- - `transcript_chars`: Character count of transcript
186
+ **字段:**
187
+ - `tool_calls`:按工具类型计数
188
+ - `total_tool_calls`:所有工具调用之和
189
+ - `total_steps`:主要执行步骤数
190
+ - `files_created`:创建的输出文件列表
191
+ - `errors_encountered`:执行期间的错误数
192
+ - `output_chars`:输出文件的总字符数
193
+ - `transcript_chars`:transcript 的字符数
194
194
 
195
195
  ---
196
196
 
197
197
  ## timing.json
198
198
 
199
- Wall clock timing for a run. Located at `<run-dir>/timing.json`.
199
+ 一次运行的墙钟计时。位于 `<run-dir>/timing.json`。
200
200
 
201
- **How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
201
+ **如何记录:** 当一个子智能体任务结束时,任务通知中会包含 `total_tokens` `duration_ms`。请立即保存——它们不会持久化到其他地方,事后无法恢复。
202
202
 
203
203
  ```json
204
204
  {
@@ -218,7 +218,7 @@ Wall clock timing for a run. Located at `<run-dir>/timing.json`.
218
218
 
219
219
  ## benchmark.json
220
220
 
221
- Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
221
+ Benchmark 模式的输出。位于 `benchmarks/<timestamp>/benchmark.json`。
222
222
 
223
223
  ```json
224
224
  {
@@ -285,30 +285,30 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
285
285
  }
286
286
  ```
287
287
 
288
- **Fields:**
289
- - `metadata`: Information about the benchmark run
290
- - `skill_name`: Name of the skill
291
- - `timestamp`: When the benchmark was run
292
- - `evals_run`: List of eval names or IDs
293
- - `runs_per_configuration`: Number of runs per config (e.g. 3)
294
- - `runs[]`: Individual run results
295
- - `eval_id`: Numeric eval identifier
296
- - `eval_name`: Human-readable eval name (used as section header in the viewer)
297
- - `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
298
- - `run_number`: Integer run number (1, 2, 3...)
299
- - `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
300
- - `run_summary`: Statistical aggregates per configuration
301
- - `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
302
- - `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
303
- - `notes`: Freeform observations from the analyzer
304
-
305
- **Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
288
+ **字段:**
289
+ - `metadata`:本次 benchmark 运行的信息
290
+ - `skill_name`:技能名
291
+ - `timestamp`:benchmark 运行的时间
292
+ - `evals_run`:评估名或 ID 列表
293
+ - `runs_per_configuration`:每种配置下的运行次数(如 3
294
+ - `runs[]`:单次运行的结果
295
+ - `eval_id`:评估的数字标识
296
+ - `eval_name`:人类可读的评估名(在 viewer 中作为分节标题)
297
+ - `configuration`:必须为 `"with_skill"` `"without_skill"`(viewer 用该字符串做分组和配色)
298
+ - `run_number`:整数运行编号(123…)
299
+ - `result`:嵌套对象,含 `pass_rate`、`passed`、`total`、`time_seconds`、`tokens`、`errors`
300
+ - `run_summary`:按配置的统计聚合
301
+ - `with_skill` / `without_skill`:各包含 `pass_rate`、`time_seconds`、`tokens`,每项含 `mean` `stddev`
302
+ - `delta`:差值字符串,如 `"+0.50"`、`"+13.0"`、`"+1700"`
303
+ - `notes`:分析智能体的自由格式观察
304
+
305
+ **重要:** viewer 严格按这些字段名读取。把 `config` 写成 `configuration` 之外的形式,或把 `pass_rate` 放在 run 的顶层而非嵌套于 `result` 之下,都会导致 viewer 显示为空或为零值。在手工生成 benchmark.json 时务必参照此 schema
306
306
 
307
307
  ---
308
308
 
309
309
  ## comparison.json
310
310
 
311
- Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
311
+ 由盲比较器输出。位于 `<grading-dir>/comparison-N.json`。
312
312
 
313
313
  ```json
314
314
  {
@@ -383,7 +383,7 @@ Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
383
383
 
384
384
  ## analysis.json
385
385
 
386
- Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
386
+ 由事后分析器输出。位于 `<grading-dir>/analysis.json`。
387
387
 
388
388
  ```json
389
389
  {
@@ -71,7 +71,7 @@ skill 是 ground truth。workflow 是更便宜、更快的近似。你的工作
71
71
  - **tier2-3**:批量抽取 + 简单语义检查
72
72
  - **tier4**(最便宜):正则无法覆盖、量又很大的关键词识别。注意:SiliconFlow 上的 tier4 模型是 Qwen3.5 thinking 模式——如果 `reasoning_content` 把 max_tokens 用光,`content` 可能返回空字符串。在依赖之前先用真实提示词测试。如果出现空响应,要么把 max_tokens 提到 ≥8192,要么缩短提示词,要么回退到 tier1-2。
73
73
 
74
- v0.7.1 两位审计 conductor(DS 和 GLM)默认都走全正则蒸馏,只有当用户显式要求"V2,带 worker LLM"时才加上 LLM 上升路径。如果你的规则目录里有任何一条规则的验证本质上就是语义性的,你应当主动伸手去用 `worker_llm_call`——不要等别人要你才用。
74
+ 一种值得警惕的失败模式:agent 默认全正则蒸馏,只有当用户显式要求"V2,带 worker LLM"时才加上 LLM 上升路径。如果你的规则目录里有任何一条规则的验证本质上就是语义性的,你应当主动伸手去用 `worker_llm_call`——不要等别人要你才用。
75
75
 
76
76
  ## Workflow 结构
77
77
 
@@ -149,6 +149,17 @@ worker LLM 的上下文窗口较小(典型 16K-32K token)。设计提示词
149
149
 
150
150
  这与 `document-parsing` 里 parser 上升的层级转移框架是同一套:由一个质量/精度评分驱动"保留 / 上升 / 跳过"的决定。
151
151
 
152
+ ### 在 tier 槽位内挑具体模型 —— 速查
153
+
154
+ 上面的 tier 框架回答"这一步该用哪个 tier?"。在某个 tier 槽位内仍然要回答"具体用哪个模型?"。下面几条启发式短期内有效(具体型号请从 `auto-model-selection` 刷新,模型代次的更替以月为单位、不是以年为单位):
155
+
156
+ - **Tier 1 / Tier 2 主力 worker**:当代的旗舰 MoE LLM(总量 200-400B、激活 ~20B 专家)是合理的起点基准。Qwen 家族当前的旗舰、DeepSeek 当代的高级模型都是这个形状;任一都行。
157
+ - **Tier 3 / Tier 4 小模型**:30B 以下优先选 Qwen 家族 —— 便宜可靠的选择最多。小尺寸下避开名字里带 `coder` / `code` 的变体(在通用 worker 任务上不可靠)。能选无 thinking 模式的就选无 thinking —— 这些任务不需要反思。
158
+ - **provider 分流**:把 conductor 和 worker 走不同的服务商,可以隔离单一服务商对同一模型的限流暴露(例如 worker 走 DeepSeek、conductor 留在 SiliconFlow)。
159
+ - **VLM / OCR**:字符 / 手写 / 印章 → 专用 OCR 模型(Paddle-OCR、 GLM-OCR、DeepSeek-OCR 或其后继)。复杂图表 / 表格 → 更大的通用 VLM。
160
+
161
+ 具体事实(确切模型名、上下文窗口、定价)用 `auto-model-selection` + Context7 查。上面的启发式过得快,但**形状**(旗舰 MoE 当主力、 30B 以下无 thinking 当便宜底、OCR 专用做字符)稳定得多。
162
+
152
163
  ## 用 Ground Truth 做测试
153
164
 
154
165
  编码 agent 基于 skill 的结果就是 ground truth。对 Samples/ 下每篇文档:
@@ -188,3 +199,64 @@ Worker LLM 通过 SiliconFlow API 访问。连接信息在 `.env` 里:
188
199
  - `TIER1` 到 `TIER4` —— 各层级的模型名称
189
200
 
190
201
  各模型当前的能力与上下文窗口大小,见 `references/worker-llm-catalog.md`。
202
+
203
+ ## 两条访问路径:`worker_llm_call` 工具(优先)vs 直接 HTTP
204
+
205
+ KC 自带一个 `worker_llm_call` 工具。能用就用 —— 引擎能看到每次调用,能统计成本和 token、做限流、并把数据进入审计。它支持批量模式:
206
+
207
+ ```
208
+ worker_llm_call({
209
+ tier: "tier1",
210
+ prompts: ["核查文档 A...", "核查文档 B...", "核查文档 C..."],
211
+ system_prompt: "你是合规助手。返回 JSON {verdict, evidence, confidence}。",
212
+ concurrency: 5 // 1-10,默认 5
213
+ })
214
+ ```
215
+
216
+ 返回 `{n_total, n_succeeded, n_failed, total_tokens_in, total_tokens_out, results: [...]}` 摘要。部分失败不会让整批失败。
217
+
218
+ ### 规范的 `workflows/common/llm_client.py`(作为模板文件随包发布)
219
+
220
+ 对于一个 **独立运行** 的 workflow(没有 KC 会话 —— 比如客户把 release 包部署后跑 `python run.py doc.pdf`),workflow 拿不到 `worker_llm_call`。规范的 HTTP 客户端 shim 作为模板文件随 kc-beta 一起发布;引擎初始化时会自动把它写入工作区的 `workflows/common/llm_client.py`。**不要自己重写**。直接用这个已经放好的文件:
221
+
222
+ ```python
223
+ from workflows.common.llm_client import call
224
+
225
+ result = call(
226
+ tier="tier2",
227
+ prompt=user_prompt,
228
+ system_prompt="你是合规助手。返回 JSON。",
229
+ max_tokens=2048,
230
+ )
231
+ # result = {"response": "...", "model_used": "...", "tier": "tier2",
232
+ # "tokens_in": N, "tokens_out": N}
233
+ ```
234
+
235
+ shim 做的事:
236
+ - 从 `.env` 读 `LLM_API_KEY` + `LLM_BASE_URL` + `TIER1..4`(多 provider 友好 —— SiliconFlow、OpenAI、Anthropic、阿里、火山等都能用)
237
+ - 以 OpenAI 兼容的 chat completions 格式发请求到配置好的 base URL
238
+ - 每次调用往 `output/llm_ledger.jsonl` 写一行,KC 审计即使在你没走 worker_llm_call 时也能还原成本
239
+ - 如果 `LLM_BASE_URL` 缺失,会显式抛错(不会偷偷回退到某个写死的 vendor URL)
240
+
241
+ **不要自己从零写 llm_client.py**。一种值得警惕的失败模式:agent 反复自己造轮子 —— 拼出来的版本要么模型 ID 过期、要么写死某个 vendor URL、要么不写 ledger,且对引擎不可见。优先用规范化版本;如果因为某种原因没有,从 kc-beta 安装目录的 `template/workflows/common/llm_client.py` 复制过来(引擎也会在 init 时自动写入 —— 检查 events.jsonl 里的 `workflows_common_populated` 事件)。
242
+
243
+ ### 更糟的反模式:规范化客户端已经存在,agent 同时另写一个并行的
244
+
245
+ 跨多次运行反复出现的一种失败模式:规范化的 `workflows/common/llm_client.py` 已经在工作区里(引擎 init 时已经写入了),**与此同时** agent 又自己写了 `workflows/llm_client.py` 或 `verify_engine_v2.py`,里面用 `requests.post(...)` 发 HTTP。之后所有真正的 LLM 工作都走自己写的那份。两个文件并排躺在工作区里。引擎的成本追踪什么都看不到。
246
+
247
+ 这样会出三件事:(1) provider 路由跑偏。手写的客户端通常读工作区 `.env` 的 `LLM_BASE_URL` —— 那是**conductor**的端点。KC 的 worker 路由(通过 `worker_llm_call` 和引擎的 worker_* 配置)被完全绕开。如果运营方把 worker 配到了另外一个服务商(比如 DeepSeek 当 worker、SiliconFlow 当 conductor),手写客户端就会拿着 worker 的模型名打到 conductor 的服务商上 —— 出 400,或者更糟地,悄悄拿到错模型的结果。(2) 成本 / 审计可见性丢失。引擎看不到这些调用;`output/llm_ledger.jsonl`(规范化客户端写的那份 ledger)也没有记录(手写客户端不写这个)。一场跑下来看起来什么 LLM 工作都没做,但实际账单已经产生。(3) 限流 / 重试 / 超时行为发散。规范化客户端 + `worker_llm_call` 继承引擎层面的健壮性(AbortSignal.timeout、429/5xx 上的 withRetry 等)。手写的 `requests.post` 一概没有 —— 会卡住、会抛各种自定义错误、或者悄悄毁掉这一轮。
248
+
249
+ **经验法则**:核查规则需要 LLM 判定时,正确选择只有两个 —— `worker_llm_call`(在 KC session 里运行)或 `from workflows.common.llm_client import call`(从 release bundle 脱离 KC 独立运行时)。如果你发现自己在为一个 LLM 调用敲 `import requests` 或者 `urllib.request.urlopen`,停下。这条代码路径会在审计里被点名是反复出现的"该用没用",然后被重写 —— 省一趟来回,第一次就用对工具。
250
+
251
+ ## sandbox_exec 超时设置(已知耗时长的命令)
252
+
253
+ `sandbox_exec` 默认超时是 120 秒。对于你预期会跑得更久的命令 —— LLM 批处理、大型回归测试、文档解析 —— 显式传 `timeout_ms`(最大 600000ms = 10 分钟)。不要靠把任务切成不必要的小块来绕开默认值;那只会浪费回合数并模糊意图。
254
+
255
+ ```
256
+ sandbox_exec({
257
+ command: "python scripts/v2_full_test.py",
258
+ timeout_ms: 480000 // 14 条规则 × 6 篇文档走 worker LLM,预留 8 分钟
259
+ })
260
+ ```
261
+
262
+ 如果已经顶到 10 分钟上限还在超时,把工作拆成多次调用,或者交给子代理(子代理的超时和父进程相互独立)。
@@ -1,36 +1,36 @@
1
- # Worker LLM Catalog
1
+ # Worker LLM 目录
2
2
 
3
- Models available via SiliconFlow API for worker LLM tasks. Update this catalog as models change.
3
+ 通过 SiliconFlow API 可调用的 worker LLM 模型。模型有更新时同步维护此目录。
4
4
 
5
- ## Text Models
5
+ ## 文本模型
6
6
 
7
- | Tier | Model | Context Window | Strengths | Notes |
7
+ | 等级 | 模型 | 上下文窗口 | 优势 | 备注 |
8
8
  |------|-------|---------------|-----------|-------|
9
- | TIER1 | Pro/zai-org/GLM-5 | 128K | Strong reasoning, Chinese language | Top tier for complex judgment |
10
- | TIER1 | Pro/moonshotai/Kimi-K2.5 | 128K | Long context, strong extraction | Good for full-document processing |
11
- | TIER2 | Pro/deepseek-ai/DeepSeek-V3.2 | 64K | Balanced capability/cost | Good general purpose |
12
- | TIER2 | Pro/MiniMaxAI/MiniMax-M2.5 | 64K | Strong Chinese, fast | Good for Chinese documents |
13
- | TIER2 | Qwen/Qwen3.5-397B-A17B | 32K | Large MoE, strong reasoning | Cost-effective for complex tasks |
14
- | TIER3 | Qwen/Qwen3.5-122B-A10B | 32K | Good accuracy, lower cost | Sweet spot for many tasks |
15
- | TIER4 | Qwen/Qwen3.5-35B-A3B | 16K | Fast, cheap | Best for simple extraction |
9
+ | TIER1 | Pro/zai-org/GLM-5 | 128K | 推理能力强、中文好 | 用于复杂判定的顶级选项 |
10
+ | TIER1 | Pro/moonshotai/Kimi-K2.5 | 128K | 长上下文、抽取能力强 | 适合整篇文档处理 |
11
+ | TIER2 | Pro/deepseek-ai/DeepSeek-V3.2 | 64K | 性价比均衡 | 通用场景表现良好 |
12
+ | TIER2 | Pro/MiniMaxAI/MiniMax-M2.5 | 64K | 中文强、速度快 | 适合中文文档 |
13
+ | TIER2 | Qwen/Qwen3.5-397B-A17B | 32K | 大型 MoE,推理力强 | 复杂任务的高性价比选项 |
14
+ | TIER3 | Qwen/Qwen3.5-122B-A10B | 32K | 准确率良好、成本较低 | 多数任务的甜点位 |
15
+ | TIER4 | Qwen/Qwen3.5-35B-A3B | 16K | 快、便宜 | 简单抽取首选 |
16
16
 
17
- ## Vision/OCR Models
17
+ ## 视觉 / OCR 模型
18
18
 
19
- | Tier | Model | Strengths | Notes |
19
+ | 等级 | 模型 | 优势 | 备注 |
20
20
  |------|-------|-----------|-------|
21
- | OCR_TIER1 | zai-org/GLM-4.6V | Best OCR accuracy | Use for complex tables/charts |
22
- | OCR_TIER2 | Qwen/Qwen3.5-397B-A17B | Good general vision | Multimodal version |
23
- | OCR_TIER3 | PaddlePaddle/PaddleOCR-VL-1.5 | Fast, lightweight OCR | Best for standard text |
21
+ | OCR_TIER1 | zai-org/GLM-4.6V | OCR 准确率最高 | 用于复杂表格/图表 |
22
+ | OCR_TIER2 | Qwen/Qwen3.5-397B-A17B | 通用视觉好 | 多模态版本 |
23
+ | OCR_TIER3 | PaddlePaddle/PaddleOCR-VL-1.5 | 快、轻量 OCR | 标准文本首选 |
24
24
 
25
- ## Selection Guidelines
25
+ ## 选型要点
26
26
 
27
- - Start with the highest tier that fits your context window needs.
28
- - For extraction of simple entities (dates, amounts, names): TIER3-4 often sufficient.
29
- - For semantic judgment (adequacy, compliance): TIER1-2 usually needed.
30
- - For Chinese financial documents: prefer GLM and Qwen models over DeepSeek for domain terminology.
31
- - Context window constraint: if the section to process exceeds the model's window, either narrow the context further (tree processing) or use a model with a larger window.
27
+ - 在能满足上下文窗口需求的前提下,优先选择最高等级的模型。
28
+ - 抽取简单实体(日期、金额、姓名):TIER3-4 通常够用。
29
+ - 语义判定(充分性、合规性):通常需要 TIER1-2
30
+ - 中文金融文档:优先选择 GLM Qwen 系列,而非 DeepSeek,以更好处理行业术语。
31
+ - 上下文窗口约束:若待处理段落超出模型窗口,要么进一步收窄上下文(采用树状处理),要么换上下文更大的模型。
32
32
 
33
- ## API Configuration
33
+ ## API 配置
34
34
 
35
35
  ```python
36
36
  import openai
@@ -47,4 +47,4 @@ response = client.chat.completions.create(
47
47
  )
48
48
  ```
49
49
 
50
- This catalog should be maintained by the coding agent. Add new models as they become available, remove deprecated models, and update capability assessments based on testing experience.
50
+ 本目录由编程智能体负责维护。有新模型时及时加入,模型停服时移除,并基于测试经验更新能力评估。
@@ -225,7 +225,7 @@ KC 偏好的两种模式:
225
225
 
226
226
  ## 与其他技能的衔接
227
227
 
228
- 任务分解在 KC Reborn 生命周期中处于规则提取和技能编写之间。
228
+ 任务分解在 KC 生命周期中处于规则提取和技能编写之间。
229
229
 
230
230
  **输入**:来自 `rule-extraction` 的规则目录。每条规则是一个原子级、可测试的核查要求。如果规则尚未达到原子级别,先退回给规则提取环节做进一步分解,再进入任务分解。
231
231
 
@@ -1,81 +1,81 @@
1
- # Decision Matrix for Method Selection
1
+ # 方法选择决策矩阵
2
2
 
3
- This reference provides the detailed decision matrix for assigning methods to sub-tasks during task decomposition. Read `task-decomposition` SKILL.md first for the philosophy; this document is the operational reference.
3
+ 本文档是任务分解阶段为各子任务分配方法时使用的详细决策矩阵。先阅读 `task-decomposition` SKILL.md 了解方法论;本文档是操作层面的参考。
4
4
 
5
- ## The Four Dimensions
5
+ ## 四个维度
6
6
 
7
- | Dimension | Definition | 1 (Low) | 3 (Medium) | 5 (High) |
7
+ | 维度 | 定义 | 1(低) | 3(中) | 5(高) |
8
8
  |---|---|---|---|---|
9
- | **Certainty** | Predictability of input format and location | Free-form prose, no fixed structure | Semi-structured with known sections but variable formatting | Fixed template, exact field positions |
10
- | **Scale** | Number of items to process per document | 1-5 items | 10-100 items | 1,000+ items |
11
- | **Semantic Depth** | Language understanding required | None — pure pattern or numeric | Moderate — entity recognition, simple context | Deep — judgment, adequacy assessment, intent interpretation |
12
- | **Cost Sensitivity** | Budget constraint per document | Unlimited (one-off audit) | Moderate (monthly batch of hundreds) | Tight (daily batch of thousands) |
9
+ | **确定性** | 输入格式与位置的可预测程度 | 自由散文,无固定结构 | 半结构化,章节已知但格式多变 | 固定模板,字段位置精确 |
10
+ | **规模** | 每份文档需处理的条目数 | 1-5 | 10-100 | 1,000+ |
11
+ | **语义深度** | 所需的语言理解程度 | 无——纯模式或数值 | 中等——实体识别、简单上下文 | 深——判断、充分性评估、意图解释 |
12
+ | **成本敏感度** | 每份文档的预算约束 | 无限(一次性审计) | 中等(每月数百件批处理) | 紧(每日数千件批处理) |
13
13
 
14
- ## Method Assignment Rules
14
+ ## 方法分配规则
15
15
 
16
- Use the highest-priority method whose requirements are met. Priority order: Rule/Regex > Code > LLM > Manual.
16
+ 挑选满足条件中优先级最高的方法。优先级顺序:规则/正则 > 代码 > LLM > 人工。
17
17
 
18
- | Certainty | Scale | Semantic Depth | Cost Sensitivity | Assigned Method | Rationale |
18
+ | 确定性 | 规模 | 语义深度 | 成本敏感度 | 分配方法 | 原因 |
19
19
  |---|---|---|---|---|---|
20
- | High (4-5) | Any | Low (1-2) | Any | **Rule / Regex** | Predictable input + no language understanding = deterministic pattern matching |
21
- | High (4-5) | Any | Low (1-2) | Any | **Code / Python** | Calculations, comparisons, transformations on structured data |
22
- | Medium (3) | High (4-5) | Low (1-2) | High (4-5) | **Code + Regex** | Volume demands speed; invest in parsing code to avoid per-item LLM cost |
23
- | Medium (3) | Low (1-2) | Medium (3) | Low (1-2) | **LLM** | Moderate understanding needed, low volume makes LLM cost acceptable |
24
- | Low (1-2) | Any | High (4-5) | Any | **LLM** | Deep semantic understanding has no cheaper alternative |
25
- | Low (1-2) | High (4-5) | High (4-5) | High (4-5) | **LLM (low tier) + sampling** | Volume + semantics + budget = use cheapest LLM, sample-verify with higher tier |
26
- | Any | Any | Any | — | **Manual** | Last resort when automated methods fail accuracy threshold |
20
+ | (4-5) | 任意 | (1-2) | 任意 | **规则 / 正则** | 输入可预测 + 不需语言理解 = 确定性模式匹配 |
21
+ | (4-5) | 任意 | (1-2) | 任意 | **代码 / Python** | 在结构化数据上做计算、比较、转换 |
22
+ | (3) | (4-5) | (1-2) | (4-5) | **代码 + 正则** | 高吞吐要求速度;投入解析代码以避免逐条调 LLM 的成本 |
23
+ | (3) | (1-2) | (3) | (1-2) | **LLM** | 需要中等程度的理解,低吞吐使 LLM 成本可接受 |
24
+ | (1-2) | 任意 | (4-5) | 任意 | **LLM** | 深层语义理解没有更便宜的替代方案 |
25
+ | (1-2) | (4-5) | (4-5) | (4-5) | **低层 LLM + 抽样** | 吞吐 + 语义 + 预算 = 用最便宜的 LLM 跑、用更高层模型抽样校验 |
26
+ | 任意 | 任意 | 任意 | — | **人工** | 自动方法均无法达标时的兜底 |
27
27
 
28
- The table covers common patterns, not every combination. When a sub-task falls between categories, test both candidate methods on a sample and measure accuracy and cost. Let data decide.
28
+ 该表覆盖常见情况,并非穷举。当子任务介于两类之间,请在样本上同时测试候选方法,量化准确率与成本,让数据来定。
29
29
 
30
- ## Worked Example: Cross-Field Validation
30
+ ## 实例:跨字段校验
31
31
 
32
- **Rule**: "The loan amount must not exceed 70% of the appraised collateral value."
32
+ **规则**:"贷款金额不得超过资产评估值的 70%。"
33
33
 
34
- Decomposition into sub-tasks with method assignments:
34
+ 分解为子任务并分配方法:
35
35
 
36
- | # | Sub-task | Input | Output | Method | Rationale |
36
+ | # | 子任务 | 输入 | 输出 | 方法 | 原因 |
37
37
  |---|---|---|---|---|---|
38
- | 1 | Locate loan amount field | Full document text | Page/section reference | LLM (Tier 3) | Field position varies across document types |
39
- | 2 | Extract loan amount | Located section text | Numeric value (float) | Regex | Amount follows pattern: ¥/$/digits with commas |
40
- | 3 | Locate collateral section | Full document text | Page/section reference | LLM (Tier 3) | Section name varies: "Collateral", "Security", "Pledged Assets" |
41
- | 4 | Extract appraised value | Located section text | Numeric value (float) | Regex + Code | Regex extracts; code handles unit conversion (万/亿) |
42
- | 5 | Calculate threshold | Loan amount, collateral value | 70% threshold value | Code | Pure arithmetic: `collateral * 0.70` |
43
- | 6 | Compare | Loan amount, threshold | Pass/Fail | Code | Simple comparison: `loan_amount <= threshold` |
44
- | 7 | Generate comment | All extracted values | Comment string | Code (template) | Template: "Loan amount {X} is {above/within} 70% of collateral value {Y} (threshold: {Z})" |
38
+ | 1 | 定位贷款金额字段 | 全文 | 页/节定位 | LLM (Tier 3) | 字段位置因文档类型而异 |
39
+ | 2 | 抽取贷款金额 | 已定位段落文本 | 数值 (float) | 正则 | 金额遵循模式:¥/$/带逗号数字 |
40
+ | 3 | 定位抵押物章节 | 全文 | 页/节定位 | LLM (Tier 3) | 章节名称多变:"Collateral""Security""Pledged Assets" |
41
+ | 4 | 抽取评估价值 | 已定位段落文本 | 数值 (float) | 正则 + 代码 | 正则负责抽取;代码处理单位换算(万/亿) |
42
+ | 5 | 计算阈值 | 贷款金额、抵押价值 | 70% 阈值 | 代码 | 纯算术:`collateral * 0.70` |
43
+ | 6 | 比较 | 贷款金额、阈值 | 通过/不通过 | 代码 | 简单比较:`loan_amount <= threshold` |
44
+ | 7 | 生成批注 | 所有抽取值 | 批注字符串 | 代码(模板) | 模板:"Loan amount {X} is {above/within} 70% of collateral value {Y} (threshold: {Z})" |
45
45
 
46
- LLM calls: 2 (locate steps only). Everything else is regex or code. Total LLM cost per document: ~0.002 USD at Tier 3 pricing.
46
+ LLM 调用次数:2 次(仅定位环节)。其余全部用正则或代码。每份文档的 LLM 总成本:约 0.002 USDTier 3 价格)。
47
47
 
48
- ## Worked Example: Large-Scale Filtering
48
+ ## 实例:大规模筛选
49
49
 
50
- **Task**: Match 31,800 invoices against 15,940 contracts to find which invoices belong to which contracts.
50
+ **任务**:把 31,800 张发票与 15,940 份合同进行匹配,找出每张发票归属哪份合同。
51
51
 
52
- Naive approach: 507M pairwise LLM comparisons. Estimated cost: $50,000+. Time: weeks.
52
+ 朴素做法:5.07 亿次成对 LLM 比较。预估成本:5 万美元以上。耗时:以周计。
53
53
 
54
- Layered decomposition:
54
+ 分层分解:
55
55
 
56
- | Layer | Method | Input Size | Output Size | Reduction | Cost |
56
+ | | 方法 | 输入规模 | 输出规模 | 削减比例 | 成本 |
57
57
  |---|---|---|---|---|---|
58
- | 1. Exact match on supplier name + contract number | Rule/Regex | 507M pairs | 25,200 matches | 99.5% eliminated | ~$0 |
59
- | 2. Fuzzy match on amount range (±5%) + date overlap | Code | Remaining unmatched pairs | 12,400 candidates | 97.6% of remainder eliminated | ~$0 |
60
- | 3. Semantic comparison of line-item descriptions | LLM (Tier 3) | 12,400 candidates | 7,652 confirmed | Final precision filter | ~$25 |
61
- | 4. Manual review of low-confidence matches | Manual | ~200 uncertain | ~200 resolved | Edge cases | ~$100 (labor) |
58
+ | 1. 按供应商名 + 合同号精确匹配 | 规则/正则 | 5.07 亿组 | 25,200 组匹配 | 削减 99.5% | 约 $0 |
59
+ | 2. 按金额范围(±5%)+ 日期重叠做模糊匹配 | 代码 | 剩余未匹配组 | 12,400 组候选 | 在剩余中再削减 97.6% | $0 |
60
+ | 3. 对明细项描述做语义对比 | LLM (Tier 3) | 12,400 组候选 | 7,652 组确认匹配 | 最终精度过滤 | $25 |
61
+ | 4. 低置信度匹配的人工复核 | 人工 | 200 条不确定 | 200 条裁决 | 边界情况 | 约 $100(人力) |
62
62
 
63
- Total cost: ~$125. Time: hours. Same accuracy as the naive approach.
63
+ 总成本:约 $125。耗时:以小时计。准确率与朴素做法相当。
64
64
 
65
- The key insight: each layer's method is chosen because it is the cheapest method that can reliably make the distinctions required at that stage.
65
+ 关键洞察:每一层使用的方法,都是该阶段能可靠完成区分任务的最便宜方法。
66
66
 
67
- ## Cost Estimation Template
67
+ ## 成本估算模板
68
68
 
69
- Use this template during decomposition planning to estimate per-document cost.
69
+ 在分解规划阶段用此模板估算每份文档的成本。
70
70
 
71
- | Sub-task | Method | Est. Cost/Call | Calls/Document | Subtotal |
71
+ | 子任务 | 方法 | 单次成本估算 | 每文档调用次数 | 小计 |
72
72
  |---|---|---|---|---|
73
- | Locate section | LLM Tier 3 | $0.001 | 2 | $0.002 |
74
- | Extract fields | Regex | $0.000 | 5 | $0.000 |
75
- | Normalize values | Python | $0.000 | 5 | $0.000 |
76
- | Cross-field comparison | Python | $0.000 | 1 | $0.000 |
77
- | Semantic judgment | LLM Tier 2 | $0.003 | 1 | $0.003 |
78
- | Comment generation | Template | $0.000 | 1 | $0.000 |
79
- | **Total per document** | | | | **$0.005** |
80
-
81
- Multiply by expected document volume to get batch cost. Compare against the developer user's budget. If total exceeds budget, optimize the most expensive sub-tasks first — usually the LLM calls with the highest per-call cost or the highest call count.
73
+ | 定位章节 | LLM Tier 3 | $0.001 | 2 | $0.002 |
74
+ | 抽取字段 | 正则 | $0.000 | 5 | $0.000 |
75
+ | 规范化数值 | Python | $0.000 | 5 | $0.000 |
76
+ | 跨字段比较 | Python | $0.000 | 1 | $0.000 |
77
+ | 语义判定 | LLM Tier 2 | $0.003 | 1 | $0.003 |
78
+ | 生成批注 | 模板 | $0.000 | 1 | $0.000 |
79
+ | **每文档合计** | | | | **$0.005** |
80
+
81
+ 乘以预期文档量得到批次成本。与开发者用户的预算对比。若总成本超预算,优先优化最贵的子任务——通常是每次调用单价最高、或调用次数最多的 LLM 环节。