npm - kc-beta - Versions diffs - 0.8.1 → 0.8.3 - Mend

kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

package/template/skills/en/tree-processing/SKILL.md CHANGED Viewed

@@ -58,7 +58,7 @@ Spend time here. The patterns you find determine whether the tree builder is a s
 - This is fast, deterministic, and reliable. Prefer this when it works.
 **If patterns are inconsistent or absent**:
-- Use the LLM-guided wedge-driving approach (see `rule-extraction/references/chunking-strategies.md` for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
+- Use the LLM-guided wedge-driving approach (see the `document-chunking` skill for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
 - This is slower and costs LLM calls, but handles unstructured documents. The rolling window means even very large unstructured leaf nodes can be chunked incrementally.
 **If the document has a table of contents**:

package/template/skills/en/version-control/SKILL.md CHANGED Viewed

@@ -164,3 +164,18 @@ Every evolution cycle (see `evolution-loop`) should:
 3. Log the version change with the evolution iteration number.
 The version history, combined with the evolution logs, gives you a complete timeline of how the system evolved and why.
+## Per-rule check.py — preserve v1 before v2 rewrite
+When iterating a rule's verification logic from a v1 baseline (often pure regex) to a v2 implementation (often LLM-augmented or hybrid), **copy the v1 file to a sibling before overwriting**:
+```bash
+cp rule_skills/Rxx/check.py rule_skills/Rxx/check_v1.py
+# now write the v2 version to check.py
+```
+Convention:
+- `check.py` always points at the current best version
+- `check_v1.py`, `check_v2.py`, ... preserve prior iterations
+This way the v1 lives alongside v2 in the same directory rather than relying on workspace git archaeology (`git log -- check.py` works but is friction). Engine-level `verify_engine_v1.py` / `verify_engine_v2.py` preserve the orchestrator separately; per-rule files need their own convention.

package/template/skills/en/work-decomposition/SKILL.md CHANGED Viewed

@@ -6,7 +6,7 @@ description: Decide how to decompose the rule set into TaskBoard tasks during ru
 # Work Decomposition
-KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py becomes the unified-runner anti-pattern from E2E #4. If related rules are split across separate skills, the agent re-derives the shared chunker logic 17 times.
+KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py drifts into the unified-runner anti-pattern. If related rules are split across separate skills, the agent re-derives the shared chunker logic many times over.
 This skill is the conductor's playbook for that decision. It's tagged `tier: meta-meta` because work decomposition is a system-level discipline, not a per-rule technique. The complementary `task-decomposition` skill (also `tier: meta-meta`) covers the *internal* structure of one rule's check — locate, extract, normalize, judge, comment. This skill covers how the rule **set** should be split into TaskBoard items.
@@ -15,7 +15,7 @@ This skill is the conductor's playbook for that decision. It's tagged `tier: met
 - **Entering rule_extraction.** Read the regulation, decompose into rules, then decide how those rules will be ordered and grouped before declaring the phase done. Coverage audit + chunk refs are downstream of these decisions.
 - **Entering skill_authoring.** TaskBoard is empty (engine no longer auto-populates per-rule tasks). Read the rule list from `describeState`, decide grouping + order, then call `TaskCreate` for each unit of work.
 - **Mid-run re-decomposition.** If the TaskBoard feels wrong (rules accumulating in the wrong order, an obviously-bundled pair across two tasks), stop adding work and re-decompose. The cost of pausing 5 minutes to re-plan is recovered within 2 rules of better-shaped work.
-- **Any phase with 3+ parallel sub-goals.** If you find yourself juggling multiple parallel sub-goals in working memory (3+ rules × docs, multiple deliverable-prep items in finalization, several QC batches in production_qc), drop them into the TaskBoard and work serially. v0.7.5 audits showed distillation + production_qc benefit from explicit tasks even when the registry didn't expose this skill there — v0.8 P2-E makes the skill available in every phase.
+- **Any phase with 3+ parallel sub-goals.** If you find yourself juggling multiple parallel sub-goals in working memory (3+ rules × docs, multiple deliverable-prep items in finalization, several QC batches in production_qc), drop them into the TaskBoard and work serially. Every phase from rule_extraction through finalization benefits from explicit tasks once parallel sub-goals appear — distillation and production_qc included.
 ## Quick rule: when does the TaskBoard belong?
@@ -40,7 +40,7 @@ Pick one explicitly and write it into your first PATTERNS.md entry. "I'm going S
 Process the **hardest** rule first. Use the chunker, verdict shape, and worker tier that hard rule demands as the design floor. Process subsequent rules in descending difficulty, each one a degenerate case of the machinery already built.
-**When to pick:** the rule set has uneven complexity and you suspect a few hard rules will dictate the shape (almost always true for compliance / regulatory work). E2E #5 GLM accidentally followed this path and produced 0.6% ERROR on real LLM-driven workflows; DS started bottom-up and shipped 78% NOT_APPLICABLE.
+**When to pick:** the rule set has uneven complexity and you suspect a few hard rules will dictate the shape (almost always true for compliance / regulatory work). When this method is followed correctly it tends to produce sub-1% ERROR rates on real LLM-driven workflows; the bottom-up alternative typically over-produces NOT_APPLICABLE verdicts because the easy rules' machinery can't handle the hard cases at the end.
 **Why "Huffman" not "Shannon" for the analogy:** Huffman builds optimal prefix codes by processing low-frequency symbols first. KC's analogue is the high-cost-per-rule, low-frequency rules — the R028s that dominate the design space even though there are few of them. Touch them first. The easy rules inherit the framework cheaply.
@@ -105,10 +105,10 @@ Keep separate when ANY of:
 - Rules apply to different document types (one applies only to public-fund reports, another only to private-fund reports)
 - One rule's failure mode is a specific failure mode of another (don't bundle parent + child rules — the child's check redundantly re-runs the parent's)
-The v0.6.2 D2 anti-pattern wording captures the failure case clearly:
+The anti-pattern wording captures the failure case clearly:
 > If you find yourself writing a unified_qc.py-style monolith that bypasses individual skills, your per-rule skills are wrong. Fix them, don't replace them.
-That came from E2E #4 where one conductor wrote a 2,400-line `unified_qc.py` that ran all rules at once. It produced 1,150 ERROR verdicts (16.6%) because every rule's failure cascaded into every other rule's verdict. Per-rule skills are KC's unit of granularity for a reason.
+A failure mode worth flagging: a conductor writes a 2,000+ line `unified_qc.py` that runs all rules at once. The result is cascading errors — every rule's failure corrupts every other rule's verdict, easily producing 15%+ ERROR rates on production checks. Per-rule skills are KC's unit of granularity for a reason.
 ### Anti-pattern: stub check.py + real workflow.py
@@ -152,23 +152,13 @@ iterations of the skill (changes to regulation interpretation, edge
 cases discovered in production) need a single canonical place to
 update — the skill — not N workflows that have drifted independently.
-E2E #6 v070 surfaced this pattern (DS bundled-skill check.py files
-all returned `{"pass": null, "method": "stub"}` deferring to
-workflows/). v0.7.1 added this anti-pattern explicitly.
+Two failure modes worth flagging:
-E2E #7 v071 showed the teaching prevented the stub anti-pattern in
-both conductors (no `{"pass": null}` patterns in either run), but
-**DS still inverted the canonical-vs-distilled relationship**: DS's
-6 thematic skill folders had SKILL.md only (no check.py), with the
-real verification code living in `workflows/<skill>/check.py`. The
-absence of stubs is good; the inversion is not — editing a rule then
-requires touching both SKILL.md (the doc) and the workflow check.py
-(the code). Single source of truth is lost.
+**The pure-stub failure:** bundled-skill check.py files all return `{"pass": null, "method": "stub"}` deferring to `workflows/`. Methodology is described in SKILL.md but never executable from the skill folder.
-GLM v071 by contrast landed the canonical pattern: 97/97 skills had
-both SKILL.md AND a real `check.py` (median 143 LOC of regex +
-applicability logic), and `workflows/<id>/workflow_v1.py` was a
-50-line thin wrapper that imported and called it:
+**The inverted-canonical failure:** the agent avoids stubs (good) but inverts the canonical-vs-distilled relationship — thematic skill folders have only SKILL.md (no check.py), with the real verification code living inside `workflows/<skill>/check.py`. The absence of stubs is good; the inversion is not — editing a rule then requires touching both SKILL.md (the doc) and the workflow check.py (the code). Single source of truth is lost.
+The canonical landing looks like this: every skill has BOTH a substantive SKILL.md AND a real `check.py` (regex + applicability logic), and `workflows/<id>/workflow_v1.py` is a thin wrapper (~50 LOC) that imports and calls it:
 ```python
 # workflows/D01-01/workflow_v1.py — thin wrapper, 52 LOC
@@ -185,11 +175,7 @@ def run(doc_text: str, meta: dict = None) -> dict:
     return result
 ```
-This is the v0.7.2+ canonical pattern: workflow is a shim that
-points at the skill's check.py. To iterate on a rule's verification,
-edit `rule_skills/<id>/check.py`. The workflow doesn't change. v0.7.2
-clarifies the teaching: avoid stubs AND keep the canonical
-relationship (skill is canonical, workflow is distilled wrapper).
+This is the canonical pattern: workflow is a shim that points at the skill's check.py. To iterate on a rule's verification, edit `rule_skills/<id>/check.py`. The workflow doesn't change. The teaching has two parts: avoid stubs AND keep the canonical relationship (skill is canonical, workflow is distilled wrapper).
 ### Naming convention for grouped checks
@@ -355,7 +341,7 @@ When entering skill_authoring with an empty TaskBoard:
 ### Calling TaskCreate / TaskUpdate / TaskComplete
-The engine registers three task-board tools (v0.7.4):
+The engine registers three task-board tools:
 - `TaskCreate({id, title, phase, ruleId?})` — adds a task to `tasks.json`. `id` must be unique within the session; pick a stable shape like `<rule_id>-<phase>` for per-rule tasks or `<group-name>-<phase>` for grouped / non-rule tasks. `phase` is the current phase the task belongs to. `ruleId` is optional — set it for per-rule tasks so the engine can credit the rule_id in milestone derivation.
 - `TaskUpdate({id, status?, summary?})` — change a task's status (`pending` / `in_progress` / `completed` / `failed`), optionally with a short summary.
@@ -363,7 +349,7 @@ The engine registers three task-board tools (v0.7.4):
 ### Ralph loop scope — within a phase only
-Important contract (changed in v0.7.4 after team feedback):
+Important contract:
 - **Loop scope = current phase only.** TaskCreate populates tasks for the CURRENT phase. The Ralph loop processes them one by one within the phase.
 - **Loop exits at phase boundaries.** When all current-phase tasks complete OR the phase advances (you call `phase_advance`, or anything else changes `currentPhase`), the loop exits cleanly. Control returns to the user.
@@ -395,9 +381,9 @@ Three formats, each defensible. Pick one and stick with it:
 - **`rules/PATTERNS.md`** — concise, framework-only, updated as the project progresses. Best for greenfield projects with clear hypothesis-up-front structure. Capped at ~5 KB; entries are transferable shapes / project constraints / anti-patterns with rationale (see "What to write" above).
-- **`logs/phase_<name>_complete.md` per phase** — incremental, captures what each phase produced + decisions made + what the next phase inherits. Best for iterative discovery work where the framework crystallizes mid-run. E2E #7 GLM used this pattern across 6 phase docs and an `evolution_summary_v1.2.md`; the methodology was captured even though PATTERNS.md was never written.
+- **`logs/phase_<name>_complete.md` per phase** — incremental, captures what each phase produced + decisions made + what the next phase inherits. Best for iterative discovery work where the framework crystallizes mid-run. A real example pattern: six phase docs plus an `evolution_summary_vN.md` capture the methodology even when PATTERNS.md is never written.
-- **`AGENT.md` decisions section + domain notes** — narrative-style, living document of "what we know" and "why". Best for projects with rich domain context to capture (regulations, edge cases, thresholds, sample format distributions). E2E #7 GLM's AGENT.md included regulation enforcement dates, product type taxonomies, threshold values, and sample format counts — this is fine; it's a different idiom for the same goal.
+- **`AGENT.md` decisions section + domain notes** — narrative-style, living document of "what we know" and "why". Best for projects with rich domain context to capture (regulations, edge cases, thresholds, sample format distributions). An AGENT.md that records regulation enforcement dates, product type taxonomies, threshold values, and sample format counts works perfectly — it's a different idiom for the same goal.
 What you should NOT do: skip persistence and rely only on the live conversation context. By the time you have N skills authored without any persisted methodology, you've made N implicit decisions about verdict shape, chunker boundaries, and worker tier. Each rule re-derives from scratch. Refactoring requires touching N files instead of one.
@@ -405,15 +391,15 @@ What you should NOT do: skip persistence and rely only on the live conversation
 ✅ "Before each phase advance, write what I learned to whichever persistence file matches this project's idiom — even if it's tentative."
-E2E history:
-- E2E #6 v070 DS wrote PATTERNS.md only after a rollback. Per-skill decisions before that point had to be re-touched. v0.7.1 added "PATTERNS.md FIRST" reinforcement.
-- E2E #7 v071 neither DS nor GLM wrote PATTERNS.md, but GLM wrote 6 rich phase-completion logs and a comprehensive AGENT.md — the methodology WAS captured, just in different files. v0.7.2 blesses the broader principle: persist before you advance, format flexible.
+Failure modes worth flagging:
+- An agent writes PATTERNS.md only after a rollback. All the per-skill decisions made before that point have to be re-touched. "PATTERNS.md FIRST" exists because of this cost.
+- An agent skips PATTERNS.md entirely but writes rich phase-completion logs and a comprehensive AGENT.md. The methodology IS captured, just in different files — which is fine. The broader principle: persist before you advance, format flexible.
-The engine's filesystem-derived milestones (Group A v0.7.0) verify coverage on disk regardless of how you split the work. The TaskBoard is your scratchpad; the disk is the contract; the persistence file is your project's memory.
+The engine's filesystem-derived milestones verify coverage on disk regardless of how you split the work. The TaskBoard is your scratchpad; the disk is the contract; the persistence file is your project's memory.
 ## Subagent batch work: rolling-window writes
-When you dispatch N subagents to do batch work (regression tests, batch verification, parallel rule processing), DO NOT have them write to a shared coordination file. v0.7.5 audits found subagents racing on `tasks.json` / `rules/catalog.json` / `output/results/summary.json` — one took the workspace lock for 5+ minutes while others waited silently.
+When you dispatch N subagents to do batch work (regression tests, batch verification, parallel rule processing), DO NOT have them write to a shared coordination file. A failure mode worth flagging: subagents race on `tasks.json` / `rules/catalog.json` / `output/results/summary.json` — one takes the workspace lock for several minutes while the others wait silently.
 The right pattern: each subagent writes to its OWN file under a known prefix. The parent aggregates after all subagents finish.
@@ -433,6 +419,6 @@ output/
 # Parent agent reads all batch_regression_*.json and writes the aggregate.
 ```
-Engine signal: if you see `lock_blocked` events in events.jsonl during subagent work, that's the symptom. v0.8 P4-C added the event emission so the parent has visibility into contention before the subagent times out. If the pattern shows up, refactor to rolling-window writes.
+Engine signal: if you see `lock_blocked` events in events.jsonl during subagent work, that's the symptom — the engine emits this event so the parent has visibility into contention before the subagent times out. If the pattern shows up, refactor to rolling-window writes.
 Don't write a "coordinate via file locking" subagent batch. The locking primitives exist for safety against accidental concurrent writes, not as a queue. Use the filesystem layout as the coordination mechanism.

package/template/skills/zh/auto-model-selection/SKILL.md CHANGED Viewed

@@ -2,51 +2,72 @@
 name: auto-model-selection
 tier: meta
 description: >
-  使用 Context7 CLI 获取最新 LLM 模型信息。当需要了解可用模型、模型能力、价格、
-  上下文窗口大小、或哪个模型适合某项任务时使用——包括分层分配、Worker LLM 工作流设计、
-  模型对比、服务商 API 调用方式等。Context7 提供训练数据中可能没有的最新信息。
-  需要安装 context7 CLI (npm i -g context7)。可选插件。
+  用 Context7 CLI 查最新模型事实（参数规模、API 格式、上下文窗口），用下面的指南
+  来理解"什么类型的模型适合 doc verification 应用的哪一段"。在为某个 tier 槽位
+  挑模型、在多个服务商之间取舍、或回看现有分层分配是否合理时调用。Context7 给你
+  新鲜事实；本 skill 给你把那些事实落到 KC 流水线上的启发式经验。可选插件
+  （安装：npm i -g context7）。
 ---
-# 通过 Context7 自动选择模型
+# Auto Model Selection
-## Context7 是什么
+模型选择并不是经常调用的 skill —— 对大部分用户来说，工作区 `.env` 里的分层已经设定得很合理，4 层 + 成本敏感的精细选型其实是过度设计。本 skill 存在的两个时刻：conductor 在从零启动时配 tier 分配（少见）；在 `skill-to-workflow` 内部某个 workflow 需要挑选合适的 worker LLM。
-Context7 (`c7`) 是一个轻量 CLI 工具，可获取最新的库和 API 文档。安装：`npm i -g context7`。两个命令：
-- `c7 library <查询>` — 按名称搜索库/服务商
-- `c7 docs <libraryId> <查询>` — 获取具体文档和代码示例
+下面的内容是经验性的 —— 是作者在这个领域里摸出来什么有效。当作起点启发式即可，保质期大概 3–6 个月，毕竟模型家族迭代很快。
-## 使用时机
+## Worker LLM 家族 —— 实战启发式
-- 用户的 `model-tiers.json` 过期（KC 长时间未更新）
-- 用户切换到新服务商，需要模型发现
-- 用户明确要求更新模型选择
-- 配置向导的 `/models` 端点失败，且内置模型列表过期
+- **Qwen 家族** —— 通用 worker 工作的健壮、便宜首选。任何时候该家族的旗舰 MoE 通常都是常规抽取/分类的最佳劳力之一。小尺寸（3B–70B）数量多且稳定。默认选它没错。
+- **DeepSeek** —— 复杂任务表现优秀。当规则涉及多步推理、嵌套判断、或者 Qwen 在长上下文里开始吃力的场景，伸手去拿它。
+- **GLM 和 Kimi** —— 在和 DeepSeek 相同的"复杂任务"档位也很强。代价：通常不出小尺寸变体（3B–70B），所以只能做 tier1/tier2，做不了 tier3/tier4。
-## 工作流程
+## 旗舰 MoE 形状与 tier1 基准
-1. 用户选择服务商并提供 API 密钥
-2. 用 `c7 library <服务商名>` 找到对应的 library ID
-3. 用 `c7 docs <id> "available models"` 获取当前模型列表
-4. 从文档中识别：模型名称、能力（推理、编码、视觉）、上下文窗口大小、价格
-5. 按能力和成本分配到分层：
-   - LLM tier1：最强（复杂判断、抽取）
-   - LLM tier2-3：中等（常规抽取、简单判断）
-   - LLM tier4：最便宜（大量简单任务）
-   - VLM tier1-3：视觉模型（文档解析/OCR）
-6. 更新 `model-tiers.json` 或工作区 `.env`
+当前这一代旗舰 MoE LLM 有一个可识别的形状：总参数 200-400B，每个 token 激活 ~20B 专家。例子（几个月后就会过期）：Qwen 在 200-400B-A20B 区间的旗舰 MoE、DeepSeek-V4-Flash 等。
-## 分层原则
+这个形状是不错的 worker LLM 首选 —— 不一定是 tier1 绝对最强，但作为基准起点很合理。挑 tier1 模型时，先从这一类开始，除非有特定理由再往别处走。
-- 满足准确率阈值的最便宜模型
-- 正则是 tier0 — 比任何 LLM 都小
-- 不需要填满所有分层 — 服务商没有合适模型时留空即可
-- 在 AGENT.md 中记录哪些模型适合哪些任务
+## 30B 以下的小 LLM —— 基本免费
-## 前置条件
+参数量降到 ~30B 以下后，大部分服务商上这些模型都极便宜。Qwen 在这个区间提供了一大堆选择，质量都不错。
+这个区间挑模型有两条规则：
+- **避开 coder 变体**（名字里带 `coder` / `code` 的）—— 小尺寸的 coder 模型在通用 worker 任务上多半不可靠。只在任务确实是代码相关时再用。
+- **优先选无 thinking 模式的变体**（如果有得选）。分配给小 worker 的任务都是简单固定的，多余的思考只会浪费时间和 token。
+## VLM / OCR 选型
+第一个问题：视觉任务是什么类型？
+- **扫描件字符、印章、手写体** —— 用专门的 OCR 模型。当前的强选项（随时间会变）：Paddle-OCR 家族、GLM-OCR、DeepSeek-OCR。旧版本仍能用。纯字符识别不需要上更大的通用 VLM。
+- **图表、复杂表格、奇怪边框或无边框的结构** —— 试更大、更贵的通用 VLM。在这类场景里，OCR 专用模型和通用 VLM 之间的结构理解差距迅速拉大。
+不确定时先跑最便宜的 OCR，只在它漏掉了结构信息时再升级。
+## Context7 —— 按需查模型事实
+上面的启发式回答的是"挑哪种模型"。具体事实（当前模型名、确切上下文窗口、定价、API 格式）用 Context7 查：
 ```bash
-npm i -g context7
+c7 library <服务商名>
+c7 docs <libraryId> "available models"
 ```
-验证：`c7 library openai` 应返回结果。
+两条命令：先用前者找到服务商的 library ID，再用后者拿到最新文档和代码示例。适用场景：
+- 工作区 `model-tiers.json` 看起来过期了（KC 自上次模型发布后没更新）
+- 用户切换服务商，需要发现可用模型
+- 某个新服务商的 `/models` 端点返回空或无帮助
+- 检查 `.env` 里的模型名是不是还在线
+安装：`npm i -g context7`。验证：`c7 library openai` 应返回结果。
+## tier1/tier2 选型最终落地
+tier 分配最后决定的是成本。能满足准确率要求的最便宜模型胜出。正则是隐含的 "tier 0"，规则只靠模式匹配就能完成时，应该先伸手去拿正则 —— 关于何时从正则升级到 worker LLM，见 `skill-to-workflow`。
+不需要填满所有 tier。如果服务商没有合适的小模型，tier3/tier4 留空完全可以。把生效的选择写到 `AGENT.md`，下次会话就能继承。
+## 复看节奏
+本 skill 的启发式会过期。作者打算每 3–6 个月在新模型代际落地之后回来更新一次。如果你发现这里的建议和 Context7 今天显示的事实矛盾了，信 Context7 —— 事实已经走在前面了。

package/template/skills/zh/bootstrap-workspace/SKILL.md CHANGED Viewed

@@ -148,6 +148,19 @@ versions.json           # 版本清单（工作空间根目录）
 未来会话恢复时会先读 `AGENT.md`。它越充实，开发者用户需要重复解释的内容就越少。
+### 阶段切换的更新节奏
+一种值得警惕的反复出现的失败模式：agent 在 bootstrap 时把 AGENT.md 写得很丰富，之后就再也不碰 —— 后续若干小时的阶段工作里一次 AGENT.md 提交都没有。这就把长期记忆这个用途废了。
+要养成的节奏：**每次 phase transition 都往 `AGENT.md` 追加一行决策日志**。格式：
+```
+[<时间戳> | rule_extraction → skill_authoring]
+抽出 N 条规则；coverage_audit 已完成；R03/R05/R07 标记为判断密集型。
+```
+每次阶段切换三行摩擦；积累下来就是给下一个审计员、下一次会话的三十行洞见。格式不必严格 —— 节奏比措辞更重要。
 ## 何时需要重新初始化
 以下情况需要重新运行本技能：

package/template/skills/zh/compliance-judgment/SKILL.md CHANGED Viewed

@@ -81,3 +81,17 @@ description: Determine whether extracted entities comply with verification rules
 - **多值情况**：文档在多处出现同一实体，但取值不一致。标记为 `uncertain`，并把所有找到的值连同各自的位置都报出来，便于人工复核时定位差异源头。
 - **条件规则**："如果贷款金额超过 100 万，则必须有担保。"先核查条件再套用规则。条件不成立时规则不适用——结果记为 `pass`（如果你额外引入了 `not_applicable` 类别，也可以使用）。
 - **否定型规则**：有些规则核查的是"不存在"。"文档中不得存在向关联方提供的担保。"搜索"不存在"比搜索"存在"难，因为要先证伪所有可能的命中位置才能下结论。先把搜索做彻底，再对否定结论保持信心。
+## 跨文档的置信度阈值一致性
+对一条要跨多文档判定的规则（常见情况），**"通过"的置信度阈值必须在所有文档上保持一致**。一条在文档 A 上要求 0.85 置信度才能通过、在文档 B 上只要 0.75 的规则，其实是两条规则伪装成了一条。
+当 worker LLM 当判官时，把阈值写在 prompt 里或者写在后处理里，不要 "让 LLM 自己每次决定"。LLM 调用之间的随机性意味着每一次调用都会落到分布上的某个点；你的工作是把分布必须越过的那条线划出来。
+两种落实方式：
+- **写在 prompt 里**：明确写"只有置信度高于阈值才输出 PASS"。便宜，但易受 LLM 多次调用之间漂移影响。
+- **写在后处理里**：让 LLM 分别输出 verdict 和 confidence，再用一小段 Python wrapper 应用阈值。更可靠，引擎看到的是代码层面的阈值，不是 prompt 文本。
+对有稳定模式的规则（格式、存在性核查），优先用后处理。对主观判断（充分性、完整性），prompt 层面的阈值更好写，但值得审计 —— 抽样一个批次看 LLM 有没有遵守那条线。
+`confidence-system` skill 描述了置信度怎么从多个信号合成；本节讨论的是怎么把它在同一规则的不同文档之间一致地应用。

package/template/skills/zh/compliance-judgment/references/output-format.md CHANGED Viewed

@@ -1,48 +1,48 @@
-# Lightweight Output Format Specification
+# 轻量输出格式规范
-This document defines the compact text markup format for verification results, its grammar, JSON conversion rules, and edge case handling.
+本文档定义核查结果的紧凑文本标注格式、其语法、JSON 转换规则以及边界情况处理。
-## Grammar
+## 语法
 ```
 [RESULT] field_name <- value (constraint) | conf:score | src:location | note:text
 ```
-| Component | Required | Format | Description | Example |
+| 组成 | 是否必填 | 格式 | 说明 | 示例 |
 |-----------|----------|--------|-------------|---------|
-| `[RESULT]` | Yes | One of: PASS, FAIL, MISSING, ERROR, UNCERTAIN | The judgment outcome. | `[FAIL]` |
-| `field_name` | Yes | snake_case identifier | The rule or field being checked. | `capital_adequacy` |
-| `<- value` | No (omit for MISSING) | Free text, no pipes | The extracted value from the document. | `<- 12.5%` |
-| `(constraint)` | No (omit if no constraint) | Parenthesized expression | The expected value or condition. | `(>= 8.0%)` |
-| `conf:score` | Yes | Decimal 0.00-1.00 | Confidence score of the judgment. | `conf:0.95` |
-| `src:location` | No | Page-section reference or trace ID prefix | Source location in the document. | `src:p3-s2` |
-| `note:text` | No | Free text to end of line | Human-readable comment. | `note:Signing overdue by 45 days` |
+| `[RESULT]` | 是 | 取值之一：PASS、FAIL、MISSING、ERROR、UNCERTAIN | 判定结果。 | `[FAIL]` |
+| `field_name` | 是 | snake_case 标识符 | 被核查的规则或字段。 | `capital_adequacy` |
+| `<- value` | 否（MISSING 时省略） | 自由文本，不含竖线 | 从文档中抽取出的值。 | `<- 12.5%` |
+| `(constraint)` | 否（无约束时省略） | 括号表达式 | 期望值或条件。 | `(>= 8.0%)` |
+| `conf:score` | 是 | 0.00-1.00 的小数 | 判定的置信度分数。 | `conf:0.95` |
+| `src:location` | 否 | 页-节引用或 trace ID 前缀 | 文档中的来源位置。 | `src:p3-s2` |
+| `note:text` | 否 | 至行末的自由文本 | 人类可读的注释。 | `note:Signing overdue by 45 days` |
-Components after `field_name` are separated by ` | ` (space-pipe-space). The `<- value` and `(constraint)` components appear before the first pipe, space-separated.
+`field_name` 之后的各个组成部分以 ` | `（空格-竖线-空格）分隔。`<- value` 和 `(constraint)` 出现在第一个竖线之前，彼此以空格分隔。
-## Field Definitions
+## 字段定义
-### Result Values
+### 结果取值
-| Value | Meaning | When to Use |
+| 取值 | 含义 | 使用时机 |
 |-------|---------|-------------|
-| `PASS` | Entity complies with the rule. | Deterministic or semantic check confirms compliance. |
-| `FAIL` | Entity does not comply. | Clear non-compliance detected. Note is strongly recommended. |
-| `MISSING` | Entity not found in document. | Extraction could not locate the required field. |
-| `ERROR` | Processing failure. | Parsing error, API timeout, unexpected format. |
-| `UNCERTAIN` | Ambiguous judgment. | Borderline values, conflicting evidence, low confidence. |
+| `PASS` | 实体符合规则。 | 确定性检查或语义检查确认合规。 |
+| `FAIL` | 实体不符合规则。 | 明确检测到不合规。强烈建议填写 note。 |
+| `MISSING` | 文档中未找到该实体。 | 抽取过程无法定位到所需字段。 |
+| `ERROR` | 处理过程出错。 | 解析错误、API 超时、非预期格式。 |
+| `UNCERTAIN` | 判定存在歧义。 | 临界值、证据冲突、置信度偏低。 |
-### Confidence Score
+### 置信度分数
-A decimal between 0.00 and 1.00 representing the system's confidence in the result. For deterministic Python checks, confidence is typically 0.95-1.00. For LLM semantic judgments, confidence reflects the model's self-assessed certainty. Scores below the configured threshold in `.env` trigger human review.
+介于 0.00 与 1.00 之间的小数，表示系统对该结果的把握程度。对于确定性 Python 检查，置信度通常为 0.95-1.00。对于 LLM 语义判定，置信度反映模型自评的确定性。低于 `.env` 中配置阈值的分数会触发人工复核。
-### Source Location
+### 来源位置
-The `src:` component uses a compact reference format: `p{page}-s{section}`. Example: `src:p3-s2` means page 3, section 2. For trace ID integration, use the trace ID prefix: `src:R001-DOC042-P3-S2` (see Integration with Trace IDs below).
+`src:` 部分使用紧凑引用格式 `p{page}-s{section}`。示例：`src:p3-s2` 表示第 3 页第 2 节。如需与 trace ID 集成，使用 trace ID 前缀：`src:R001-DOC042-P3-S2`（详见下文"与 Trace ID 的集成"）。
-## JSON Conversion
+## JSON 转换
-### Markup to JSON
+### 标注 → JSON
 ```
 Input:  [FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note:Signing overdue by 45 days
@@ -59,31 +59,31 @@ Output:
 }
 ```
-Pseudocode:
-1. Parse `[RESULT]` -> lowercase -> `result` field.
-2. Parse next token -> `field` field.
-3. If `<-` follows, parse until `(` or `|` -> `extracted_value`.
-4. If `(...)` follows, parse contents -> `expected`.
-5. Split remaining by ` | `. For each segment:
-   - `conf:X` -> `confidence` (parse as float).
-   - `src:X` -> `source`.
-   - `note:X` -> `comment`.
+伪代码：
+1. 解析 `[RESULT]`，转小写，赋值给 `result` 字段。
+2. 解析下一个 token，赋值给 `field` 字段。
+3. 若后随 `<-`，解析到 `(` 或 `|` 为止，赋值给 `extracted_value`。
+4. 若后随 `(...)`，解析括号内容，赋值给 `expected`。
+5. 将剩余部分按 ` | ` 拆分。对每一段：
+   - `conf:X` → `confidence`（按浮点数解析）。
+   - `src:X` → `source`。
+   - `note:X` → `comment`。
-### JSON to Markup
+### JSON → 标注
-Pseudocode:
-1. `[` + uppercase(`result`) + `] ` + `field`.
-2. If `extracted_value` exists: ` <- ` + `extracted_value`.
-3. If `expected` exists: ` (` + `expected` + `)`.
-4. ` | conf:` + format(`confidence`, 2 decimal places).
-5. If `source` exists: ` | src:` + `source`.
-6. If `comment` exists: ` | note:` + `comment`.
+伪代码：
+1. `[` + 大写(`result`) + `] ` + `field`。
+2. 若存在 `extracted_value`：` <- ` + `extracted_value`。
+3. 若存在 `expected`：` (` + `expected` + `)`。
+4. ` | conf:` + 格式化(`confidence`, 保留 2 位小数)。
+5. 若存在 `source`：` | src:` + `source`。
+6. 若存在 `comment`：` | note:` + `comment`。
-## Diff Example
+## Diff 示例
-Comparing two verification runs is where markup shines.
+对比两次核查运行，正是标注格式最能发挥优势的场景。
-**Markup diff** (clean, scannable):
+**标注 diff**（干净、易扫读）：
 ```
   [PASS] capital_adequacy <- 12.5% (>= 8.0%) | conf:0.95 | src:p3-s2
 - [PASS] sign_date_gap <- 28d (<= 30d) | conf:0.92 | src:p1-s4
@@ -91,7 +91,7 @@ Comparing two verification runs is where markup shines.
   [MISSING] collateral_value | conf:0.60 | note:Collateral valuation not found
 ```
-**JSON diff** (noisy, hard to scan):
+**JSON diff**（噪声大、难以扫读）：
 ```json
   {
     "field": "sign_date_gap",
@@ -108,44 +108,44 @@ Comparing two verification runs is where markup shines.
   }
 ```
-The markup diff communicates the same information in one changed line vs. five changed lines.
+同样的信息，标注 diff 只需要一行变更，而 JSON diff 需要五行。
-## Edge Cases
+## 边界情况
-### Multi-Value Fields
-When a field has multiple extracted values (e.g., the same metric appears in two places with different values), separate values with semicolons:
+### 多值字段
+当一个字段抽取出多个值（例如同一个指标在两处出现且数值不一致），用分号分隔多个值：
 ```
 [UNCERTAIN] total_assets <- 1,234,567;1,234,590 | conf:0.50 | src:p3-s1;p7-s2 | note:Conflicting values found
 ```
-### Long Notes
-In markup, truncate notes longer than 80 characters with `...`. The full text is preserved in JSON. Example:
+### 长注释
+在标注格式中，超过 80 字符的 note 截断为 `...`。完整文本保留在 JSON 中。示例：
 ```
 [FAIL] risk_disclosure <- (see detail) | conf:0.85 | note:Missing discussion of liquidity risk, market risk, and operational ri...
 ```
-### Special Characters
-If a value or note contains the pipe character `|`, escape it with a backslash: `\|`. During JSON conversion, unescape back to `|`.
+### 特殊字符
+如果值或 note 中包含竖线 `|`，用反斜杠转义：`\|`。转换为 JSON 时再反转义回 `|`。
-### Fields with No Constraint
-Omit the parenthetical entirely:
+### 没有约束条件的字段
+完全省略括号部分：
 ```
 [MISSING] collateral_value | conf:0.60 | note:Collateral valuation not found in document
 ```
-### Fields with No Extracted Value
-Omit the `<-` component (common for MISSING and ERROR results):
+### 没有抽取值的字段
+省略 `<-` 部分（在 MISSING 和 ERROR 结果中很常见）：
 ```
 [ERROR] capital_adequacy | conf:0.00 | note:PDF parsing failed on page 3
 ```
-## Integration with Trace IDs
+## 与 Trace ID 的集成
-The `src:` component can encode trace ID prefixes, linking each result line to the full trace ID defined by `version-control`. Use the trace ID format directly:
+`src:` 部分可以编码 trace ID 前缀，将每一行结果与 `version-control` 定义的完整 trace ID 关联。直接采用 trace ID 格式即可：
 ```
 [PASS] capital_adequacy <- 12.5% (>= 8.0%) | conf:0.95 | src:R001-DOC042-P3-S2
 [FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:R003-DOC042-P1-S4 | note:Signing overdue by 45 days
 ```
-When converting to JSON, the `src:` value maps to the `trace_id` field in the full result object. The character range (`C{start}:{end}`) can be appended when full precision is needed: `src:R001-DOC042-P3-S2-C120:180`.
+转换为 JSON 时，`src:` 的值映射为完整结果对象中的 `trace_id` 字段。当需要更高精度时，可在末尾追加字符范围 (`C{start}:{end}`)：`src:R001-DOC042-P3-S2-C120:180`。

package/template/skills/zh/confidence-system/SKILL.md CHANGED Viewed

@@ -67,22 +67,47 @@ description: Design and calibrate confidence scoring for extraction and verifica
 - **近似匹配**：文档与某个边缘案例有相似特征但未完全匹配 → 略微降低置信度。
 - **无匹配**：正常文档 → 不调整。
+### 信号五：格式模式吻合度
+抽取出来的值是否符合该规则隐含的"可正则化"格式？很多字段都有可识别的形状 —— 电话号、身份证号、日期、金额、监管编码 —— 都能用正则便宜地校验一遍，独立于 LLM 自己对结果的信心。
+- 值匹配预期的格式模式：正信号。
+- 值违反格式模式（比如电话字段里出现字母）：强负信号，常是幻觉的指示。
+- 字段没有适用的格式模式：中性。
+这能抓住一种常见失败：LLM 正确定位了"值在哪里"但把格式弄错了（少一位数、漏国家码、返回了"详见附录"之类的占位词）。
+### 信号六：统计离群
+对数值类字段，这个值是否远离其他文档中同字段的典型范围？
+- 在均值的 1 个标准差内：中性 / 轻微正向。
+- 2–3 个标准差：轻微负向。
+- 超出 3 个标准差，或落在领域上不可能的区间：强负信号 —— 经常是单位错误（元 vs. 万元）、小数点错位、或多/少一位的幻觉。
+对金额、百分比、比率类字段尤其有用。参考区间从 QC 已确认的历史数据计算得出，周期性刷新即可。对分类字段，对应做法是"值不在已观测的值集合内" —— 见到新值就送审。
 ## 信号组合
-用加权平均组合多个信号：
+把上面的信号合成单个置信度分数。常用形式是加权求和：
 ```
-confidence = w1 × method_prior + w2 × source_match + w3 × historical_accuracy + w4 × corner_case_adj
+confidence = w_method   × method_prior
+           + w_source   × source_match
+           + w_history  × historical_accuracy
+           + w_corner   × corner_case_adj
+           + w_format   × format_conformance
+           + w_outlier  × outlier_check
 ```
-### 初始权重建议
+权重怎么定 **是你自己针对每条规则、每个项目要做的判断**。下面是几条定向原则，不是规定值：
+- 历史准确率在有数据之后是最有预测力的信号 —— 跑了几轮 QC 后该加大权重。
+- 方法先验和源文本匹配始终可用 —— 在历史数据还没有的早期，它们撑着场子。
+- 格式吻合度和统计离群信号不依赖 LLM —— LLM 过度自信时它们还在工作，这正是你需要它们的时候。
+- 边缘案例调整的权重通常较小，但一旦匹配到已知边缘案例就要果断生效。
-| 信号 | 权重 | 说明 |
-|-----|------|------|
-| w1（方法先验） | 0.25 | 基础信号，始终可用 |
-| w2（源文本匹配） | 0.25 | 反幻觉信号，始终可用 |
-| w3（历史准确率） | 0.35 | 最重要的信号，但需要数据积累 |
-| w4（边缘案例距离） | 0.15 | 辅助信号 |
+历史数据还没攒起来时，把 w_history 的权重重新分配给其他信号。最合适本条规则、本套语料的权重会在校准循环里慢慢显形 —— 那是下一节描述的过程。
 ### 历史数据不可用时的处理