kc-beta 0.7.1 → 0.7.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +15 -8
- package/package.json +1 -1
- package/src/agent/engine.js +32 -2
- package/src/agent/pipelines/_milestone-derive.js +65 -42
- package/src/agent/pipelines/finalization.js +2 -6
- package/src/agent/pipelines/initializer.js +13 -0
- package/src/agent/tools/copy-to-workspace.js +17 -12
- package/src/agent/tools/release.js +151 -1
- package/src/agent/tools/sandbox-exec.js +4 -1
- package/src/agent/tools/task-board.js +194 -0
- package/src/agent/tools/workspace-file.js +58 -44
- package/src/config.js +6 -4
- package/src/util/kc-version.js +27 -0
- package/template/CLAUDE.md +13 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +77 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +26 -0
- package/template/skills/en/meta-meta/work-decomposition/SKILL.md +76 -9
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +65 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +26 -0
- package/template/skills/zh/meta-meta/work-decomposition/SKILL.md +74 -9
|
@@ -132,6 +132,53 @@ existing catalog. Therefore, when composing the brief:
|
|
|
132
132
|
catalog.json.** rule_catalog uses workspace file locking (B9);
|
|
133
133
|
sandbox_exec bypasses it and races with other writers.
|
|
134
134
|
|
|
135
|
+
## 如何读取规则文件 (默认整本读取)
|
|
136
|
+
|
|
137
|
+
法规文件是审核的权威依据。你为每条规则记录的 `source_ref` 都要能在
|
|
138
|
+
原文中复核。对于绝大多数规则文件 (单个文件 < 50 KB / < ~100 页),
|
|
139
|
+
**用 `workspace_file` (operation=read) 一次性整本读取**:
|
|
140
|
+
|
|
141
|
+
```js
|
|
142
|
+
workspace_file({ operation: "read", scope: "project", path: "Rules/01_某某办法.md" })
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
`workspace_file.read` 单次上限 50,000 字符, 足以覆盖几乎所有单个法规
|
|
146
|
+
文件。这是默认行为: **在抽取规则之前, 把每一份法规文件都整本读一遍。**
|
|
147
|
+
|
|
148
|
+
### 工具选择 — `workspace_file` 还是 `sandbox_exec`
|
|
149
|
+
|
|
150
|
+
| 工具 | 单次上限 | 适用 |
|
|
151
|
+
|---|---:|---|
|
|
152
|
+
| `workspace_file` (read) | 50,000 字符 | **整本读取法规/规则文件** |
|
|
153
|
+
| `sandbox_exec` (cat/head) | 10,000 字符 | 短命令, 不适合整文件读取 |
|
|
154
|
+
|
|
155
|
+
`sandbox_exec` 是为执行 shell 命令设计的, 10K 上限对绝大多数法规太小。
|
|
156
|
+
`cat rules/01_*.md` 只会返回前 ~10 KB, 后面被截断为 `\n[truncated]`。
|
|
157
|
+
反复用 `head -N` / `tail -M` 滑动窗口会丢失行号位置信息, 也浪费交互
|
|
158
|
+
回合。**遇到截断, 别和上限较劲——换工具。**
|
|
159
|
+
|
|
160
|
+
### 法规与样本的不对称 — 法规整本读, 样本按需抽样
|
|
161
|
+
|
|
162
|
+
法规通常只有 1–10 份, 权威性强, 只需读一次。每一份法规都整本读取,
|
|
163
|
+
作为后续所有规则抽取与引用的基础。
|
|
164
|
+
|
|
165
|
+
样本文档可能 30 份甚至 1000+ 份, 异质性强, 在测试阶段会被多次读取。
|
|
166
|
+
**不要试图把每个样本都整本读一遍**——用规则适用性过滤、抽样子集来
|
|
167
|
+
聚焦注意力。
|
|
168
|
+
|
|
169
|
+
### 例外 — 单个法规超过 200K 字符时
|
|
170
|
+
|
|
171
|
+
实践中极少见。test_data_4 中最大的法规 42 KB; 银行业 资管新规 +
|
|
172
|
+
信披办法 等典型法规也都在 50 KB 以内。但如果你确实遇到一份超大法规,
|
|
173
|
+
读取整本会挤压上下文窗口 (启发式: 单文件超过 ~200,000 字符 或超过你
|
|
174
|
+
上下文预算的 ~25%), 此时由你判断:
|
|
175
|
+
|
|
176
|
+
- 按章 (`第X章`) 分段读, 用 `document_parse` 或分页的 `workspace_file`
|
|
177
|
+
- 或建立工作区内的索引文件, 标注每章的偏移位置, 抽取规则时按需读取
|
|
178
|
+
|
|
179
|
+
50 KB 的上限已经足够高, 上述例外情形几乎不会触发。**默认就是整本读;
|
|
180
|
+
只有当文件确实太大时才偏离这一默认。**
|
|
181
|
+
|
|
135
182
|
## Extraction Strategies
|
|
136
183
|
|
|
137
184
|
### Strategy 1: Structured Input (Developer User Provides Rules)
|
|
@@ -222,6 +269,24 @@ Regulations are often ambiguous. When you encounter ambiguity:
|
|
|
222
269
|
|
|
223
270
|
Do not skip ambiguous rules. They are often the most important ones.
|
|
224
271
|
|
|
272
|
+
## Sanity-check applicability against the sample corpus
|
|
273
|
+
|
|
274
|
+
After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
|
|
275
|
+
|
|
276
|
+
For every rule:
|
|
277
|
+
1. Walk `samples/`, classify each by product type / report type / document format
|
|
278
|
+
2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
|
|
279
|
+
3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
|
|
280
|
+
|
|
281
|
+
E2E #7 GLM produced a 97-rule catalog where 36 rules (37%) had `PASS=0 FAIL=0 NOT_APPLICABLE=90` across all 90 documents — they never fired. Some were legit (rules for cash-management products with no cash-management samples in corpus), but 36 inactive of 97 was high enough to suggest scope-too-narrow drift.
|
|
282
|
+
|
|
283
|
+
If many rules are 0-sample, either:
|
|
284
|
+
- **Reframe their applicability** — broaden product types, look for evidence in headers/footers not just body, relax the scope filter
|
|
285
|
+
- **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
|
|
286
|
+
- **Update the test corpus** to include matching samples (work with the developer user)
|
|
287
|
+
|
|
288
|
+
Catching this in `rule_extraction` is much cheaper than authoring 36 skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
|
|
289
|
+
|
|
225
290
|
## When Rules Change
|
|
226
291
|
|
|
227
292
|
Regulations evolve. When the developer user adds new or updated regulation documents:
|
|
@@ -45,6 +45,32 @@ If yes, design a worker LLM prompt. Use the smallest model tier that maintains a
|
|
|
45
45
|
### The hybrid approach (most common)
|
|
46
46
|
Most rules are a mix: regex extracts the number, Python compares it to the threshold, LLM handles the exceptional cases. Design the workflow as a pipeline where cheap steps run first and expensive steps run only when needed.
|
|
47
47
|
|
|
48
|
+
### When regex alone isn't enough — decision rubric
|
|
49
|
+
|
|
50
|
+
Before declaring distillation complete, audit each rule's `verification_type` / `metric` / `evidence_type` (or equivalent fields in your catalog). For rules where the required verification is one of:
|
|
51
|
+
|
|
52
|
+
- **Semantic** ("is this a positive guarantee or a disclaimer?")
|
|
53
|
+
- **Contextual** ("interpret this in light of the document's product type")
|
|
54
|
+
- **Counterfactual** ("what should this value be, given the other fields?")
|
|
55
|
+
- **Cross-field arithmetic** ("does 期初 + 收益 - 分配 = 期末?")
|
|
56
|
+
|
|
57
|
+
regex alone rarely suffices. Three acceptable forms:
|
|
58
|
+
|
|
59
|
+
1. **Pure regex with documented limits** — write the regex check, include a comment explaining the fragility (e.g., "matches syntactic pattern only; cannot detect semantic guarantees")
|
|
60
|
+
2. **Hybrid regex + LLM** — regex baseline catches obvious cases, `worker_llm_call` (tier1-2) handles ambiguous ones. The hybrid workflow declares which rule_ids escalate.
|
|
61
|
+
3. **Pure LLM via `worker_llm_call`** — for fully semantic rules where no regex baseline is meaningful.
|
|
62
|
+
|
|
63
|
+
Don't ship pure regex for a rule whose `verification_type` is `judgment` / `semantic` without the documented-limits note. Future-you or a colleague will assume the regex is sufficient and that bug will hide for months.
|
|
64
|
+
|
|
65
|
+
### Worker LLM cost-aware tier choice
|
|
66
|
+
|
|
67
|
+
If you do escalate to LLM:
|
|
68
|
+
- **tier1** (most capable, ~¥0.001-0.002/doc): cross-field reasoning, ambiguity resolution, rules that benefit from chain-of-thought
|
|
69
|
+
- **tier2-3**: bulk extraction with simple semantic checks
|
|
70
|
+
- **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
|
|
71
|
+
|
|
72
|
+
Both v0.7.1 audit conductors (DS and GLM) defaulted to all-regex distillation and only added LLM escalation when the human user explicitly asked for "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
|
|
73
|
+
|
|
48
74
|
## Workflow Structure
|
|
49
75
|
|
|
50
76
|
A workflow is a Python file (or small set of files) in `workflows/`:
|
|
@@ -85,7 +85,7 @@ KC 的 main agent 是指挥者。指挥者决定下一步做什么——而这
|
|
|
85
85
|
- 一条规则的判断逻辑是另一条的子串或近似变体
|
|
86
86
|
- 一次失败通常意味着多条规则同时失败(R013 不可能在 R015 失败的情况下通过)
|
|
87
87
|
|
|
88
|
-
例:R013 / R015 / R017 都在检查报告第 3 页那张表是否包含某些必填字段。同一个 chunk、同一次 parse、同一种 verdict 形状。合并为 `check_r013_r015_r017.py
|
|
88
|
+
例:R013 / R015 / R017 都在检查报告第 3 页那张表是否包含某些必填字段。同一个 chunk、同一次 parse、同一种 verdict 形状。合并为 `check_r013_r015_r017.py`,并创建一个任务:`TaskCreate({id: "R013-R015-R017-skill_authoring", title: "R013/R015/R017 — 必填字段表", phase: "skill_authoring"})`。引擎从文件系统推导里程碑时会识别这个合并 check.py,给三个 rule_id 都计入覆盖。
|
|
89
89
|
|
|
90
90
|
### 何时保持独立
|
|
91
91
|
|
|
@@ -144,6 +144,39 @@ E2E #6 v070 暴露了这个反模式(DS 把所有 bundled skill 的 check.py
|
|
|
144
144
|
都写成 `{"pass": null, "method": "stub"}` 推给 workflows/)。
|
|
145
145
|
v0.7.1 把这个反模式显式写进 skill。
|
|
146
146
|
|
|
147
|
+
E2E #7 v071 显示这个反 stub 的引导在两个 conductor 上都生效(两条 run
|
|
148
|
+
里都没有 `{"pass": null}` 这种 stub 模式),但是 **DS 仍然把"正典 vs
|
|
149
|
+
蒸馏"的关系搞反了**:DS 写了 6 个主题分组的 skill 文件夹,每个只有
|
|
150
|
+
SKILL.md(没有 check.py),真正的验证代码却在
|
|
151
|
+
`workflows/<skill>/check.py` 里。没有 stub 是好事;关系搞反不是 ——
|
|
152
|
+
要修改一条规则的逻辑就得同时改 SKILL.md(文档)和 workflow check.py
|
|
153
|
+
(代码),单一信息源就丢了。
|
|
154
|
+
|
|
155
|
+
GLM v071 反而把正典模式落地了:97/97 个 skill 都同时有 SKILL.md 和
|
|
156
|
+
真正的 `check.py`(regex + 适用性判断的代码,中位 143 行),而
|
|
157
|
+
`workflows/<id>/workflow_v1.py` 是一个 50 行的薄壳,只是 import 并
|
|
158
|
+
调用 skill 的 check.py:
|
|
159
|
+
|
|
160
|
+
```python
|
|
161
|
+
# workflows/D01-01/workflow_v1.py — 薄壳,52 行
|
|
162
|
+
import importlib.util, json
|
|
163
|
+
from pathlib import Path
|
|
164
|
+
|
|
165
|
+
def run(doc_text: str, meta: dict = None) -> dict:
|
|
166
|
+
check_path = Path(__file__).parent.parent.parent / "rule_skills" / "D01-01" / "check.py"
|
|
167
|
+
spec = importlib.util.spec_from_file_location("check", check_path)
|
|
168
|
+
mod = importlib.util.module_from_spec(spec)
|
|
169
|
+
spec.loader.exec_module(mod)
|
|
170
|
+
result = mod.check(doc_text, meta)
|
|
171
|
+
result["_workflow"] = "D01-01_v1"
|
|
172
|
+
return result
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
这是 v0.7.2+ 的正典模式:workflow 是个壳,指向 skill 的 check.py。
|
|
176
|
+
迭代规则验证逻辑时,编辑 `rule_skills/<id>/check.py`,workflow 不用动。
|
|
177
|
+
v0.7.2 把引导说得更清楚:既不要 stub,也要保留正典关系(skill 是
|
|
178
|
+
正典,workflow 是蒸馏过的薄壳)。
|
|
179
|
+
|
|
147
180
|
### 合并 check 的命名约定
|
|
148
181
|
|
|
149
182
|
确实需要合并时,文件名要把范围写明:
|
|
@@ -304,18 +337,50 @@ PATTERNS.md 全文控制在约 5 KB 之内。超过时,剪掉最不可执行
|
|
|
304
337
|
5. **挑第一个任务**。做到完整(skill + check + 至少一次本地测试)。把学到的写进 PATTERNS.md。换下一个任务。
|
|
305
338
|
6. **任务做到第 5 个、第 10 个时**:停下来重读 PATTERNS.md。如果新积累的 pattern 暗示要重构早期工作,**现在做**(便宜)而不是更晚(昂贵)。
|
|
306
339
|
|
|
307
|
-
###
|
|
340
|
+
### 调用 TaskCreate / TaskUpdate / TaskComplete
|
|
341
|
+
|
|
342
|
+
引擎注册了三个任务面板工具(v0.7.3+):
|
|
343
|
+
|
|
344
|
+
- `TaskCreate({id, title, phase, ruleId?})` —— 在 `tasks.json` 中新增一条任务。`id` 在本会话内必须唯一;per-rule 任务建议用 `<rule_id>-<phase>` 这种稳定形状,分组 / 非规则任务用 `<group-name>-<phase>`。`phase` 是该任务所属的阶段(当前阶段或你预先排好的未来阶段)。`ruleId` 可选 —— 设上之后,引擎在里程碑推导时能把这个 rule_id 计入覆盖。
|
|
345
|
+
- `TaskUpdate({id, status?, summary?})` —— 把任务状态改为 `pending` / `in_progress` / `completed` / `failed`,可选附一行简要 summary。
|
|
346
|
+
- `TaskComplete({id, summary?})` —— `TaskUpdate({id, status:"completed", summary})` 的语法糖。完成一个工作单元后走这条最常用的路径。
|
|
347
|
+
|
|
348
|
+
调用 `TaskCreate` 把你的拆分写进面板、本回合结束之后,Ralph 循环会取下一条 pending 任务执行。完成工作、调 `TaskComplete`,循环再前进。如果一条任务无法完成(不可恢复的错误),调 `TaskUpdate({id, status:"failed", summary:"原因"})`,让队列继续推进而不是被堵在那里。
|
|
349
|
+
|
|
350
|
+
示例:
|
|
351
|
+
|
|
352
|
+
```
|
|
353
|
+
TaskCreate({ id: "R001-skill_authoring", title: "为 R001 撰写 skill",
|
|
354
|
+
phase: "skill_authoring", ruleId: "R001" })
|
|
355
|
+
|
|
356
|
+
TaskCreate({ id: "trust-bundle-skill_authoring",
|
|
357
|
+
title: "R013/R015/R017 — 必填字段表",
|
|
358
|
+
phase: "skill_authoring" })
|
|
359
|
+
|
|
360
|
+
TaskComplete({ id: "R001-skill_authoring",
|
|
361
|
+
summary: "正则核查在 89/90 通过;R001 完成" })
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
### 持久化方法论 —— PATTERNS.md 或 phase 日志 或 AGENT.md decisions
|
|
365
|
+
|
|
366
|
+
原则:在每次 phase 推进之前,把框架级的决定写到磁盘。对话会被 compact、agent 会重启、下一个 phase 会失去上下文。不管你选哪种格式,**写到磁盘** —— 不要依赖会消失的对话上下文。
|
|
367
|
+
|
|
368
|
+
三种格式都站得住,挑一种坚持下去:
|
|
369
|
+
|
|
370
|
+
- **`rules/PATTERNS.md`** —— 简洁,只装框架级内容,随项目推进而更新。适合假设可以前置、结构清晰的全新项目。上限 ~5 KB;条目是可迁移的形状 / 项目级约束 / 反模式加原因(参考上面"该写什么"一节)。
|
|
308
371
|
|
|
309
|
-
|
|
372
|
+
- **每阶段写 `logs/phase_<name>_complete.md`** —— 增量式,记录每个 phase 产出了什么、做了哪些决定、下个 phase 继承什么。适合"边发现边定型"的迭代式工作。E2E #7 GLM 用了这个模式:6 篇 phase 文档 + `evolution_summary_v1.2.md`,方法论照样捕获了,只是没写 PATTERNS.md。
|
|
310
373
|
|
|
311
|
-
|
|
374
|
+
- **`AGENT.md` decisions 段 + 领域笔记** —— 叙事风格,是关于"我们知道什么"和"为什么"的活文档。适合需要捕获丰富领域上下文的项目(法规、边缘案例、阈值、样本格式分布)。E2E #7 GLM 的 AGENT.md 里有法规生效日期、产品类型分类、阈值数值、样本格式数量 —— 完全 OK,是相同目标的不同惯用法。
|
|
312
375
|
|
|
313
|
-
|
|
376
|
+
不该做的事:跳过持久化、只靠对话上下文活着。等你写到第 N 条 skill 还没把方法论写到磁盘时,你已经做了 N 个关于 verdict 形状、chunker 边界、worker tier 的隐式决定 —— 每条规则都从零推导,重构要碰 N 个文件而不是一个。
|
|
314
377
|
|
|
315
|
-
|
|
378
|
+
❌ "等我有空再来记录这些洞察。"
|
|
316
379
|
|
|
317
|
-
|
|
380
|
+
✅ "每次 phase 推进之前,把这一阶段学到的东西写到适合本项目惯用法的那个持久化文件里 —— 哪怕只是初稿。"
|
|
318
381
|
|
|
319
|
-
E2E
|
|
382
|
+
E2E 历史:
|
|
383
|
+
- E2E #6 v070 DS 在用户介入回退之后才写 PATTERNS.md。那之前每条 skill 的设计决定都各自固化,之后还要再碰一遍。v0.7.1 加了"PATTERNS.md FIRST"的引导。
|
|
384
|
+
- E2E #7 v071 DS 和 GLM 都没写 PATTERNS.md,但 GLM 写了 6 篇 phase 完成日志和一份内容详尽的 AGENT.md —— 方法论 *捕获了*,只是放在了不同文件里。v0.7.2 把更宽的原则写进 skill:推进之前先持久化,格式灵活。
|
|
320
385
|
|
|
321
|
-
引擎从文件系统推导里程碑(v0.7.0 Group A)会按磁盘事实核验覆盖率,无论你怎么切分工作。TaskBoard
|
|
386
|
+
引擎从文件系统推导里程碑(v0.7.0 Group A)会按磁盘事实核验覆盖率,无论你怎么切分工作。TaskBoard 是你的草稿;磁盘才是契约;持久化文件是项目的记忆。
|