npm - kc-beta - Versions diffs - 0.7.2 → 0.7.5 - Mend

kc-beta 0.7.2 → 0.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (90) hide show

package/template/skills/phase_skills.yaml ADDED Viewed

@@ -0,0 +1,107 @@
+# Phase × skills registry — single source of truth for KC's skill scoping.
+#
+# v0.7.5: edit this file once; SkillLoader propagates to system-prompt
+# injection (always_loaded bodies inline), workspace skills/ population
+# (available set symlinked into <workspace>/skills/), and audit-script
+# comparison.
+#
+# Schema:
+#   phases:
+#     <phase_name>:
+#       always_loaded: [<skill_name>, ...]   # bodies injected into system prompt
+#       available: [<skill_name>, ...]       # consultable via consult_skill tool
+#
+# Always-loaded skills are auto-added to `available` at load time
+# (always_loaded ⊆ available conceptually). The list in `available`
+# below excludes already-always-loaded entries for readability.
+#
+# When adjusting: skill names must match a directory under
+# template/skills/{lang}/<name>/ containing a SKILL.md.
+phases:
+  bootstrap:
+    always_loaded:
+      - bootstrap-workspace
+    available:
+      - auto-model-selection
+      - data-sensibility
+      - document-parsing
+      - document-chunking
+      - version-control
+  rule_extraction:
+    always_loaded:
+      - rule-extraction
+    available:
+      - work-decomposition
+      - rule-graph
+      - data-sensibility
+      - document-parsing
+      - document-chunking
+      - version-control
+  skill_authoring:
+    always_loaded:
+      - skill-authoring
+      - work-decomposition
+    available:
+      - data-sensibility
+      - entity-extraction
+      - tree-processing
+      - compliance-judgment
+      - rule-graph
+      - corner-case-management
+      - evolution-loop
+      - skill-to-workflow
+      - skill-creator
+      - version-control
+  skill_testing:
+    always_loaded:
+      - evolution-loop
+    available:
+      - skill-authoring
+      - skill-to-workflow
+      - tree-processing
+      - corner-case-management
+      - compliance-judgment
+      - data-sensibility
+      - rule-graph
+      - version-control
+  distillation:
+    always_loaded:
+      - skill-to-workflow
+      - evolution-loop
+    available:
+      - skill-authoring
+      - task-decomposition
+      - corner-case-management
+      - confidence-system
+      - entity-extraction
+      - compliance-judgment
+      - version-control
+  production_qc:
+    always_loaded:
+      - quality-control
+      - evolution-loop
+    available:
+      - skill-authoring
+      - skill-to-workflow
+      - confidence-system
+      - cross-document-verification
+      - corner-case-management
+      - compliance-judgment
+      - dashboard-reporting
+      - version-control
+  finalization:
+    always_loaded:
+      - quality-control
+    available:
+      - skill-authoring
+      - skill-to-workflow
+      - dashboard-reporting
+      - version-control
+      - pdf-review-dashboard

package/template/skills/zh/{meta-meta/auto-model-selection → auto-model-selection}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: auto-model-selection
+tier: meta
 description: >
   使用 Context7 CLI 获取最新 LLM 模型信息。当需要了解可用模型、模型能力、价格、
   上下文窗口大小、或哪个模型适合某项任务时使用——包括分层分配、Worker LLM 工作流设计、

package/template/skills/zh/{meta-meta/bootstrap-workspace → bootstrap-workspace}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: bootstrap-workspace
+tier: meta-meta
 description: Initialize and configure a document verification workspace. Use when a developer user first opens this workspace, when .env needs configuration, or when the business scenario needs to be understood. Guides the coding agent through reading regulation documents, understanding the developer user's business context, configuring model tiers and thresholds, and establishing the working relationship. Covers initial conversation with developer user to scope the verification task, set expectations, and agree on checkpoints.
 ---

package/template/skills/{en/meta → zh}/compliance-judgment/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: compliance-judgment
+tier: meta
 description: Determine whether extracted entities comply with verification rules. Use after entity extraction to make the pass/fail judgment for each rule on each document. Covers translating natural language rules into executable logic, choosing between Python calculation and LLM semantic judgment, and producing actionable comments on failures. Also use when designing the judgment step of a workflow or when a rule's judgment logic needs debugging.
 ---

package/template/skills/zh/{meta/confidence-system → confidence-system}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: confidence-system
+tier: meta
 description: Design and calibrate confidence scoring for extraction and verification results. Use when building any workflow that needs to quantify trust in its output, when setting up quality control sampling thresholds, or when calibrating existing confidence scores against actual accuracy. Confidence is the bridge between workflows and quality control. Also use when the quality control skill reports that confidence scores do not correlate with actual correctness.
 ---

package/template/skills/zh/{meta/corner-case-management → corner-case-management}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: corner-case-management
+tier: meta
 description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use when the evolution loop classifies a failure as a corner case (affecting less than ~10% of documents), when adding a new edge case to the registry, or when deciding whether a corner case should be promoted to a systemic fix. Also use when designing the corner case detection mechanism for a workflow.
 ---

package/template/skills/zh/{meta/cross-document-verification → cross-document-verification}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: cross-document-verification
+tier: meta
 description: Perform case-level analysis across multiple documents for the same transaction. Use when documents do not exist in isolation — main contracts have appendices, loan applications come bundled with income certificates, bank statements, credit reports, and property appraisals. Use to build comparison matrices, detect contradictions (hard mismatches and soft implausibilities), classify severity, and flag fraud signals. Also use when user or end-user reports a cross-document inconsistency — these reports are ground truth and take priority over agent judgment.
 ---

package/template/skills/zh/{meta-meta/dashboard-reporting → dashboard-reporting}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: dashboard-reporting
+tier: meta-meta
 description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
 ---

package/template/skills/zh/{meta/data-sensibility → data-sensibility}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: data-sensibility
+tier: meta
 description: Build intuition about document data before writing extraction logic. Use before designing any extraction schema or regex pattern, when onboarding a new document type, or when extraction accuracy is unexpectedly low and you suspect a data assumption is wrong. Covers systematic observation of raw documents, spot-checking extracted results, distribution analysis, and recognizing suspicious patterns. If you are about to write code that touches document data and you have not read at least five documents end-to-end, stop and use this skill first.
 ---

package/template/skills/{en/meta → zh}/document-chunking/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: document-chunking
+tier: meta
 description: >
   Fast, cheap chunking for processing batches of sample and input documents.
   Use when you need to split documents into manageable pieces for initial observation,

package/template/skills/zh/{meta/document-parsing → document-parsing}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: document-parsing
+tier: meta
 description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
 ---

package/template/skills/{en/meta → zh}/entity-extraction/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: entity-extraction
+tier: meta
 description: Extract specific entities, values, and text segments from documents as required by verification rules. Use after tree processing has located the relevant section, when a rule needs a specific number, date, name, amount, clause, or any domain-specific entity extracted. Covers extraction method selection (regex vs LLM), schema design, postprocessing, and confidence annotation. Also use when designing the extraction step of a workflow for worker LLMs.
 ---

package/template/skills/zh/{meta-meta/evolution-loop → evolution-loop}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: evolution-loop
+tier: meta-meta
 description: Drive continuous improvement of skills and workflows through the diagnose-classify-fix-retest cycle. Use after any testing round reveals failures, when production quality control flags issues, or when accuracy drops below thresholds. Covers failure analysis, distinguishing systemic issues from corner cases, deciding whether to rewrite or patch, and knowing when to stop iterating. The evolution loop is the heartbeat of the system. Also use when transitioning between lifecycle phases (skill testing, workflow testing, production monitoring).
 ---

package/template/skills/zh/{meta-meta/pdf-review-dashboard → pdf-review-dashboard}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: pdf-review-dashboard
+tier: meta
 description: >
   生成双栏 PDF 审核面板，用于人工核查验证结果。左侧显示原始 PDF 文档，右侧显示验证结果。
   点击结果条目可跳转至 PDF 对应页面。当开发者用户需要对照源文件审核验证输出、

package/template/skills/zh/{meta-meta/quality-control → quality-control}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: quality-control
+tier: meta-meta
 description: Design and execute quality control for production verification workflows. Use when workflows are deployed on Input/ documents and results need to be monitored, when designing the QC sampling strategy for a rule, or when evaluating whether monitoring can be reduced. Covers LLM-as-Judge evaluation, adaptive sampling strategies, confidence-based triage, and the transition from active monitoring to stable oversight. Also use when production quality drops and you need to diagnose whether to trigger the evolution loop.
 ---

package/template/skills/zh/{meta-meta/rule-extraction → rule-extraction}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: rule-extraction
+tier: meta
 description: Extract and organize business verification rules from regulation documents into discrete, testable units. Use when processing documents in Rules/ to identify individual verification rules, when decomposing a regulation into atomic checks, or when the developer user adds new regulation files. Covers reading regulation text, identifying rule boundaries, determining granularity, handling cross-references, and producing a rule catalog. Also use when rules are provided in structured formats like xlsx or csv.
 ---
@@ -132,6 +133,53 @@ existing catalog. Therefore, when composing the brief:
   catalog.json.** rule_catalog uses workspace file locking (B9);
   sandbox_exec bypasses it and races with other writers.
+## 如何读取规则文件 (默认整本读取)
+法规文件是审核的权威依据。你为每条规则记录的 `source_ref` 都要能在
+原文中复核。对于绝大多数规则文件 (单个文件 < 50 KB / < ~100 页),
+**用 `workspace_file` (operation=read) 一次性整本读取**:
+```js
+workspace_file({ operation: "read", scope: "project", path: "Rules/01_某某办法.md" })
+```
+`workspace_file.read` 单次上限 50,000 字符, 足以覆盖几乎所有单个法规
+文件。这是默认行为: **在抽取规则之前, 把每一份法规文件都整本读一遍。**
+### 工具选择 — `workspace_file` 还是 `sandbox_exec`
+| 工具 | 单次上限 | 适用 |
+|---|---:|---|
+| `workspace_file` (read) | 50,000 字符 | **整本读取法规/规则文件** |
+| `sandbox_exec` (cat/head) | 10,000 字符 | 短命令, 不适合整文件读取 |
+`sandbox_exec` 是为执行 shell 命令设计的, 10K 上限对绝大多数法规太小。
+`cat rules/01_*.md` 只会返回前 ~10 KB, 后面被截断为 `\n[truncated]`。
+反复用 `head -N` / `tail -M` 滑动窗口会丢失行号位置信息, 也浪费交互
+回合。**遇到截断, 别和上限较劲——换工具。**
+### 法规与样本的不对称 — 法规整本读, 样本按需抽样
+法规通常只有 1–10 份, 权威性强, 只需读一次。每一份法规都整本读取,
+作为后续所有规则抽取与引用的基础。
+样本文档可能 30 份甚至 1000+ 份, 异质性强, 在测试阶段会被多次读取。
+**不要试图把每个样本都整本读一遍**——用规则适用性过滤、抽样子集来
+聚焦注意力。
+### 例外 — 单个法规超过 200K 字符时
+实践中极少见。test_data_4 中最大的法规 42 KB; 银行业 资管新规 +
+信披办法 等典型法规也都在 50 KB 以内。但如果你确实遇到一份超大法规,
+读取整本会挤压上下文窗口 (启发式: 单文件超过 ~200,000 字符 或超过你
+上下文预算的 ~25%), 此时由你判断:
+- 按章 (`第X章`) 分段读, 用 `document_parse` 或分页的 `workspace_file`
+- 或建立工作区内的索引文件, 标注每章的偏移位置, 抽取规则时按需读取
+50 KB 的上限已经足够高, 上述例外情形几乎不会触发。**默认就是整本读;
+只有当文件确实太大时才偏离这一默认。**
 ## Extraction Strategies
 ### Strategy 1: Structured Input (Developer User Provides Rules)

package/template/skills/zh/{meta-meta/rule-graph → rule-graph}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: rule-graph
+tier: meta-meta
 description: Build and maintain a graph of relationships between verification rules — shared entities, logical dependencies, and conflicts. Use when analyzing the impact of a regulation change, when optimizing extraction to avoid duplicate work, when checking rule catalog completeness, or when rolling up document-level results into a summary. Critical constraint — the graph is an overlay for analysis, NOT a prerequisite for execution. Every rule must remain independently runnable.
 ---

package/template/skills/zh/{meta-meta/skill-authoring → skill-authoring}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: skill-authoring
+tier: meta
 description: Write each verification rule into a Claude Code skill folder following the official skill format. Use when converting extracted rules into skill folders, when iterating on existing rule skills after testing, or when the developer user wants to capture domain knowledge as a skill. Each skill folder must be self-contained with business logic in SKILL.md, code in scripts/, regulation context in references/, and sample data in assets/. Also use the bundled skill-creator for the full eval/iterate workflow.
 ---

package/template/skills/zh/skill-creator/SKILL.md CHANGED Viewed

@@ -1,6 +1,7 @@
 ---
 name: skill-creator
-description: Anthropic 官方 skill 脚手架工具——用于迭代/优化已有 skill 或对其运行 evaluation，不是构建 KC per-rule 核查 skill 的首选参考。要写 KC 规则 skill，先读 `meta-meta/skill-authoring`（规范目录结构 + 粒度规则 + KC 特定的 check.py 入口约定）和 `meta-meta/work-decomposition`（排序与分组决策）。本 skill 适用于：per-rule skill 已经存在、agent 想优化其 description/触发或跑正式 evaluation 时。
+tier: meta
+description: Anthropic 官方 skill 脚手架工具——用于迭代/优化已有 skill 或对其运行 evaluation，不是构建 KC per-rule 核查 skill 的首选参考。要写 KC 规则 skill，先 consult `skill-authoring`（规范目录结构 + 粒度规则 + KC 特定的 check.py 入口约定）和 `work-decomposition`（排序与分组决策）。本 skill 适用于：per-rule skill 已经存在、agent 想优化其 description/触发或跑正式 evaluation 时。
 ---
 # Skill Creator

package/template/skills/zh/skill-to-workflow/SKILL.md ADDED Viewed

@@ -0,0 +1,190 @@
+---
+name: skill-to-workflow
+tier: meta
+description: 将一条已通过测试的验证 skill 蒸馏为带 worker LLM 提示词的 Python workflow。当某条规则 skill 已经过测试、达到 `.env` 中定义的 SKILL_ACCURACY 阈值时使用。覆盖如下决定：哪些部分用代码实现、哪些部分用 LLM 调用；针对小上下文窗口的提示词工程；模型层级选择与渐进式降级；以及如何用编码 agent 自己的 skill 结果作为 ground truth 来测试 workflow。也用于对已有 workflow 做成本或速度上的优化。
+---
+# Skill 到 Workflow（Skill to Workflow）
+skill 是 ground truth。workflow 是更便宜、更快的近似。你的工作是让这个近似在尽可能便宜的同时，逼近原版的精度。
+## 工程目标
+优化整条链路：**最短 workflow**（节点数最少）→ **每个节点用最小模型**（在满足精度的前提下用最便宜的层级）→ **每个模型用最短提示词**（最少 token）。这才是工程目标——不是提示词模板的华丽程度，也不是某种框架的合规性。
+## 何时开始
+满足以下条件时，一条 skill 才算准备好被蒸馏为 workflow：
+- 它已经在 Samples/ 下所有文档上跑过测试。
+- 它的准确率达到或超过 `.env` 中的 SKILL_ACCURACY 阈值。
+- 边缘案例都记录在 skill 的 `assets/corner_cases.json` 里。
+- 你对这条规则的理解，已经足以一字一句地说清楚自己是怎么验证它的。
+任意一条不成立，就回去继续迭代 skill，先别开始蒸馏。
+## 蒸馏决策
+对基于 skill 的验证流程里每一步，问自己：
+### 这一步能用正则或 Python 完成吗？（成本：零）
+- 已知格式的日期抽取 → 正则
+- 阈值数值比较 → Python 算术
+- 中文数字转换 → Python 查表
+- 格式校验（身份证号、代码） → 正则
+- 从结构化 markdown 抽表格单元 → 字符串处理
+如果能，就写成代码。这类操作免费、快速、确定性强。
+### 这一步需要语言理解吗？（成本：一次 worker LLM 调用）
+- 在文档里找出相关段落 → LLM
+- 抽取一个用自然语言描述的实体 → LLM
+- 判定语义充分性（"披露是否充分"） → LLM
+- 解析有歧义的引用 → LLM
+如果是，就设计一个 worker LLM 提示词。在保证精度的前提下，用最小的模型层级。
+### 混合方案（最常见）
+大多数规则是混合体：正则抽数字，Python 比阈值，LLM 处理少数特殊情形。把 workflow 设计成流水线——便宜的步骤先跑，昂贵的步骤只在需要时才跑。
+### 正则不足以应付时——决策标准
+在宣布蒸馏完成之前，先审计每条规则的 `verification_type` / `metric` / `evidence_type`（或目录里对应的字段）。如果某条规则所需的验证属于以下类型之一：
+- **语义** 判断
+- **上下文** 解读
+- **反事实** 推理
+- **跨字段算术**
+仅靠正则几乎肯定不够。可接受的形式有三种：
+1. **纯正则，附带显式限制说明** —— 写正则核查，并在注释里说明脆弱性（例如："只匹配语法模式；无法检测语义保证"）
+2. **正则 + LLM 混合** —— 正则基线处理明显的情形，`worker_llm_call`（tier1-2）处理有歧义的情形。混合 workflow 要显式声明哪些 rule_id 会被上升到 LLM。
+3. **纯 LLM，通过 `worker_llm_call`** —— 对完全语义化、没有有意义正则基线的规则。
+对 `verification_type` 是 `judgment` / `semantic` 的规则，不要只交付一段纯正则、又不附"显式限制"的说明。未来的你或同事会以为正则就够用——这种 bug 能埋藏好几个月。
+### Worker LLM 的成本-意识层级选择
+如果确实要上 LLM：
+- **tier1**（能力最强，~¥0.001-0.002/doc）：跨字段推理、歧义解析、能受益于 chain-of-thought 的规则
+- **tier2-3**：批量抽取 + 简单语义检查
+- **tier4**（最便宜）：正则无法覆盖、量又很大的关键词识别。注意：SiliconFlow 上的 tier4 模型是 Qwen3.5 thinking 模式——如果 `reasoning_content` 把 max_tokens 用光，`content` 可能返回空字符串。在依赖之前先用真实提示词测试。如果出现空响应，要么把 max_tokens 提到 ≥8192，要么缩短提示词，要么回退到 tier1-2。
+v0.7.1 两位审计 conductor（DS 和 GLM）默认都走全正则蒸馏，只有当用户显式要求"V2，带 worker LLM"时才加上 LLM 上升路径。如果你的规则目录里有任何一条规则的验证本质上就是语义性的，你应当主动伸手去用 `worker_llm_call`——不要等别人要你才用。
+## Workflow 结构
+一个 workflow 是 `workflows/` 下的一个 Python 文件（或几个相关的小文件）：
+```
+workflows/
+  rule_001_capital_adequacy/
+    workflow_v1.py        # The main workflow script
+    prompts/
+      extract.txt         # Worker LLM prompt for extraction
+      judge.txt           # Worker LLM prompt for judgment (if needed)
+    config.json           # Model assignments, thresholds
+```
+workflow 文件应当有清晰的入口：
+```python
+def verify(document_text: str, config: dict) -> dict:
+    """
+    Returns:
+        {
+            "rule_id": "R001",
+            "result": "pass" | "fail" | "missing" | "error",
+            "extracted_value": ...,
+            "confidence": 0.0-1.0,
+            "comment": "..." (only when fail),
+            "model_used": "...",
+            "llm_calls": int,
+            "llm_tokens": int
+        }
+    """
+```
+这是参考，不是死契约。按具体规则的需要调整结构。重要的是每个 workflow 都能产出可以与 skill ground truth 做对比的结果。
+## Worker LLM 的提示词工程
+worker LLM 的上下文窗口较小（典型 16K-32K token）。设计提示词时要满足：
+1. **自包含。** 模型需要的一切都写进提示词。不要假设模型还记得之前几次调用的上下文。
+2. **指定输出格式。** "返回一个 JSON 对象，字段包括：value、confidence、reasoning。" 结构化输出能减少解析错误。
+3. **只送进窄化后的上下文。** 不要把整篇文档喂给它。用树状处理流水线（整篇文档 → 相关章 → 相关节）把上下文窄化之后再调 worker LLM。
+4. **提示词用文档同语言。** 中文文档配中文提示词，英文文档配英文提示词。不要在同一份提示词里混用两种语言。
+5. **示例要克制。** 一两个例子有用，十个例子既浪费上下文窗口、又容易过拟合。
+## 模型层级选择
+对每一步，先用最高层级（TIER1）。测精度。再尝试更低的层级：
+1. 用 TIER1 在所有 Samples/ 上跑 workflow，记录每一步的精度。
+2. 对每一步，换 TIER2 试一次。如果精度仍高于 WORKFLOW_ACCURACY，就保留 TIER2。
+3. 继续逐步降级，直到精度跌破阈值。
+4. 把每一步的最优层级写进 `config.json`。
+同一个 workflow 内不同步骤可以用不同层级。抽取也许需要 TIER2，而判断也许用 TIER3 就够。
+### 正式降级协议
+上面这种基础做法可用，但更严格的协议能避免过早锁定层级：
+**方向**：从上往下（TIER1 → TIER4），先确立精度上限。你得先知道最优精度能到哪里，才能开始用它换成本。
+**最小测试样本**：在做层级决定之前，每个候选层级至少跑足够数量的文档（例如 `min(10, total_samples)`）。小样本不可靠——3 篇文档的测试可能完全误导你。
+**精度差触发条件**：如果某个较低层级的精度明显低于较高层级（例如差超过 5 个百分点），该步就保留较高层级。差距在容差内，就用更便宜的层级。
+**逐步独立**：每个 workflow 步骤单独评估。把每一步的最优层级写到 `config.json` 里。不要假设整条 workflow 都得用同一个层级。
+**再评估触发条件**：如果生产质控发现某一步的精度在退化（例如出现了新格式的文档），就对那一步重跑层级评估。
+**模型-任务推荐表**：维护一份"任务类型 → 推荐层级"的项目级映射，基于你自己的测试经验。时间长了，这些表可以跨项目汇总，形成通用的层级建议。
+这里所有数字（10 篇文档、5 个百分点等）都只是推荐起点。编码 agent 和开发者用户应当根据具体的体量、精度要求、成本约束做校准——甚至彻底替换为别的评估方法。重要的是模式：**在每个层级测试 → 对比精度 → 在容差内时锁定 → 退化时再评估**。
+这与 `document-parsing` 里 parser 上升的层级转移框架是同一套：由一个质量/精度评分驱动"保留 / 上升 / 跳过"的决定。
+## 用 Ground Truth 做测试
+编码 agent 基于 skill 的结果就是 ground truth。对 Samples/ 下每篇文档：
+1. 跑一遍 workflow。
+2. 把 workflow 的结果与 skill 的结果对比。
+3. 记录差异：哪一步失败，期望值 vs 实际值。
+4. 计算精度：`(匹配的结果数) / (总文档数)`。
+5. 如果精度 < WORKFLOW_ACCURACY，定位并修复。用 `evolution-loop` 方法学。
+## 版本管理
+每次迭代是一个新版本文件：`workflow_v1.py`、`workflow_v2.py`，依此类推。在 `config.json` 里追踪当前激活的版本。完整方法学见 `version-control` skill。
+## Workflow 发布
+workflow 达到精度阈值后，就可以通过 `release` 工具打包给最终用户。每次发布是 `output/releases/<slug>/` 下的一个自包含目录，里面有钉住的 workflow、一个 Python 运行器、一个置信度评分器、一个 HTML 仪表盘生成器，以及一个 `serve.sh` 启动脚本。整个包不依赖 kc-beta——任何装了 Python 并有 worker LLM API key 的人都能跑 `python run.py <doc>` 并得到验证结果。
+打包什么内容由你决定：是 catalog 里所有规则，还是用 `include` 参数挑出来的子集；要不要捆绑 1-3 份代表性样本放到 `fixtures/`，好让接收方在没有自己数据的情况下也能空跑一遍。
+`release` 工具会先给工作区打 git 快照（tag 是 `snap/release-<slug>`），即使 `output/releases/` 之后被清理，整个包也能从 git 再生。何时发布由你决定——没有自动化，也没有强制的节奏。常见触发点：workflow 达到 SKILL/WORKFLOW_ACCURACY 阈值；某位利益相关者需要交接；生产 cron 应该跑钉住的版本而不是最新版。和开发者用户讨论后决定。
+## 成本追踪
+追踪每次 workflow 运行的成本：
+- 每篇文档的 LLM 调用次数。
+- 每篇文档消耗的总 token 数。
+- 每次调用使用的模型层级。
+这份数据帮助开发者用户理解生产成本，也为后续优化提供依据。
+## Worker LLM API
+Worker LLM 通过 SiliconFlow API 访问。连接信息在 `.env` 里：
+- `SILICONFLOW_API_KEY` —— 鉴权
+- `SILICONFLOW_BASE_URL` —— API 端点
+- `TIER1` 到 `TIER4` —— 各层级的模型名称
+各模型当前的能力与上下文窗口大小，见 `references/worker-llm-catalog.md`。

package/template/skills/zh/{meta-meta/task-decomposition → task-decomposition}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: task-decomposition
+tier: meta-meta
 description: Decompose each verification rule into independent sub-tasks and assign the optimal method (rule, code, LLM, manual) to each. Use when converting extracted rules into implementation plans, when a rule skill is too expensive or inaccurate and needs restructuring, or when designing a multi-step verification pipeline. Covers MECE decomposition, method selection via the four-dimension decision matrix, cost-benefit analysis, and source tagging. Also use when auditing an existing workflow for cost optimization opportunities.
 ---

package/template/skills/zh/{meta/tree-processing → tree-processing}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: tree-processing
+tier: meta
 description: >
   Design production-grade document chunking mechanisms for verification workflows. Use when
   building the chunking step of a workflow that will run repeatedly on many documents.

package/template/skills/zh/{meta-meta/version-control → version-control}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: version-control
+tier: meta
 description: Manage versioning of skills, workflows, prompts, and system configuration throughout the lifecycle. Use when skills are modified, workflows are regenerated, prompts are updated, or any artifact needs rollback capability. Covers what to version, how to version with file-system conventions, maintaining a version manifest, and rollback procedures. Also use when comparing performance between versions or when production results need to trace back to the exact workflow version that produced them.
 ---

package/template/skills/zh/{meta-meta/work-decomposition → work-decomposition}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: work-decomposition
+tier: meta-meta
 description: 在 rule_extraction → skill_authoring 过渡阶段决定如何把规则集拆分为 TaskBoard 任务。涵盖排序方法（难度优先 / Shannon–Huffman、广度优先、深度优先、二分切分）、分组策略（多条规则合并到一个任务 vs. 各自独立的判断标准）、三轴难度评估、以及如何写一份贯穿全流程都能用得上的 PATTERNS.md 项目记忆。当进入 rule_extraction、进入 skill_authoring 或感觉 TaskBoard 走偏想要重新拆分时使用。
 ---
@@ -7,7 +8,7 @@ description: 在 rule_extraction → skill_authoring 过渡阶段决定如何把
 KC 的 main agent 是指挥者。指挥者决定下一步做什么——而这个决定凌驾于后续所有选择之上。错误的拆分会让整个会话变得昂贵：规则顺序错了，agent 会把同一种结构重新设计三遍；不相关的规则被合并到一个 skill 里，最终 check.py 就会变成 E2E #4 那种"统一执行器"反模式；本应合并的相关规则被分散到不同 skill，agent 会把同样的 chunker 逻辑重新推导 17 次。
-这份 skill 是指挥者做这类决定的操作手册。它放在 `meta-meta/` 下，因为工作拆分是系统级的纪律，不是某条规则的具体技巧。互补的 `task-decomposition`（同样在 `meta-meta/` 下）覆盖单条规则**内部**的结构——locate → extract → normalize → judge → comment。本 skill 覆盖的是规则**集合**该如何切分成 TaskBoard 任务。
+这份 skill 是指挥者做这类决定的操作手册。它的层级标记是 `tier: meta-meta`，因为工作拆分是系统级的纪律，不是某条规则的具体技巧。互补的 `task-decomposition`（同样 `tier: meta-meta`）覆盖单条规则**内部**的结构——locate → extract → normalize → judge → comment。本 skill 覆盖的是规则**集合**该如何切分成 TaskBoard 任务。
 ## 何时使用本 skill
@@ -85,7 +86,7 @@ KC 的 main agent 是指挥者。指挥者决定下一步做什么——而这
 - 一条规则的判断逻辑是另一条的子串或近似变体
 - 一次失败通常意味着多条规则同时失败（R013 不可能在 R015 失败的情况下通过）
-例：R013 / R015 / R017 都在检查报告第 3 页那张表是否包含某些必填字段。同一个 chunk、同一次 parse、同一种 verdict 形状。合并为 `check_r013_r015_r017.py`，并创建一个 TaskCreate 任务 `R013/R015/R017 — 必填字段表`。引擎从文件系统推导里程碑时会识别这个合并 check.py，给三个 rule_id 都计入覆盖。
+例：R013 / R015 / R017 都在检查报告第 3 页那张表是否包含某些必填字段。同一个 chunk、同一次 parse、同一种 verdict 形状。合并为 `check_r013_r015_r017.py`，并创建一个任务：`TaskCreate({id: "R013-R015-R017-skill_authoring", title: "R013/R015/R017 — 必填字段表", phase: "skill_authoring"})`。引擎从文件系统推导里程碑时会识别这个合并 check.py，给三个 rule_id 都计入覆盖。
 ### 何时保持独立
@@ -337,6 +338,40 @@ PATTERNS.md 全文控制在约 5 KB 之内。超过时，剪掉最不可执行
 5. **挑第一个任务**。做到完整（skill + check + 至少一次本地测试）。把学到的写进 PATTERNS.md。换下一个任务。
 6. **任务做到第 5 个、第 10 个时**：停下来重读 PATTERNS.md。如果新积累的 pattern 暗示要重构早期工作，**现在做**（便宜）而不是更晚（昂贵）。
+### 调用 TaskCreate / TaskUpdate / TaskComplete
+引擎注册了三个任务面板工具（v0.7.4）：
+- `TaskCreate({id, title, phase, ruleId?})` —— 在 `tasks.json` 中新增一条任务。`id` 在本会话内必须唯一；per-rule 任务建议用 `<rule_id>-<phase>` 这种稳定形状，分组 / 非规则任务用 `<group-name>-<phase>`。`phase` 是该任务所属的当前阶段。`ruleId` 可选 —— 设上之后引擎在里程碑推导时能把这个 rule_id 计入覆盖。
+- `TaskUpdate({id, status?, summary?})` —— 把任务状态改为 `pending` / `in_progress` / `completed` / `failed`，可选附一行简要 summary。
+- `TaskComplete({id, summary?})` —— `TaskUpdate({id, status:"completed", summary})` 的语法糖。完成一个工作单元后走这条最常用的路径。
+### Ralph 循环范围 —— 仅限当前阶段
+重要契约（v0.7.4 在团队反馈后调整）：
+- **循环范围 = 仅当前阶段**。TaskCreate 只能为当前阶段建任务，Ralph 循环在阶段内逐条处理。
+- **阶段边界 = 循环退出**。当前阶段任务全部完成、或阶段推进（你调 `phase_advance`、或任何其他地方改了 `currentPhase`）时，循环干净退出，控制权回到用户。
+- **引擎不再自动推进阶段**。即使所有任务完成 + 退出条件满足，引擎也不会自动跳到下一阶段。阶段推进是你**显式调** `phase_advance`，或用户重新 prompt 的事。
+- **不要为未来阶段预先建任务**。它们会被忽略 —— 循环在阶段边界先退出，根本不处理它们。只为你**当前所在**的阶段建任务。
+- **阶段边界 = 用户检查点**。这是有意为之。团队需要在自然断点上看进度。完成你这一批任务、调 `phase_advance` 之后，循环退出，你在最后一条消息里向用户汇报进度，用户再 prompt 你开始下一阶段。
+"从 bootstrap 一路跑到 finalization 不停"这种端到端无人值守，**不是引擎该做的事** —— 这个能力以后会以外部 driver 的形式实现（`/loop` 风格命令），由它跨阶段地反复调用 agent。在一次调用内，你就把当前阶段做完、推进、回到用户。
+示例：
+```
+TaskCreate({ id: "R001-skill_authoring", title: "为 R001 撰写 skill",
+             phase: "skill_authoring", ruleId: "R001" })
+TaskCreate({ id: "trust-bundle-skill_authoring",
+             title: "R013/R015/R017 — 必填字段表",
+             phase: "skill_authoring" })
+TaskComplete({ id: "R001-skill_authoring",
+               summary: "正则核查在 89/90 通过；R001 完成" })
+```
 ### 持久化方法论 —— PATTERNS.md 或 phase 日志 或 AGENT.md decisions
 原则：在每次 phase 推进之前，把框架级的决定写到磁盘。对话会被 compact、agent 会重启、下一个 phase 会失去上下文。不管你选哪种格式，**写到磁盘** —— 不要依赖会消失的对话上下文。