npm - kc-beta - Versions diffs - 0.7.3 → 0.7.5 - Mend

kc-beta 0.7.3 → 0.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (88) hide show

package/template/skills/zh/{meta-meta/work-decomposition → work-decomposition}/SKILL.md RENAMED Viewed

@@ -1,5 +1,6 @@
 ---
 name: work-decomposition
+tier: meta-meta
 description: 在 rule_extraction → skill_authoring 过渡阶段决定如何把规则集拆分为 TaskBoard 任务。涵盖排序方法（难度优先 / Shannon–Huffman、广度优先、深度优先、二分切分）、分组策略（多条规则合并到一个任务 vs. 各自独立的判断标准）、三轴难度评估、以及如何写一份贯穿全流程都能用得上的 PATTERNS.md 项目记忆。当进入 rule_extraction、进入 skill_authoring 或感觉 TaskBoard 走偏想要重新拆分时使用。
 ---
@@ -7,7 +8,7 @@ description: 在 rule_extraction → skill_authoring 过渡阶段决定如何把
 KC 的 main agent 是指挥者。指挥者决定下一步做什么——而这个决定凌驾于后续所有选择之上。错误的拆分会让整个会话变得昂贵：规则顺序错了，agent 会把同一种结构重新设计三遍；不相关的规则被合并到一个 skill 里，最终 check.py 就会变成 E2E #4 那种"统一执行器"反模式；本应合并的相关规则被分散到不同 skill，agent 会把同样的 chunker 逻辑重新推导 17 次。
-这份 skill 是指挥者做这类决定的操作手册。它放在 `meta-meta/` 下，因为工作拆分是系统级的纪律，不是某条规则的具体技巧。互补的 `task-decomposition`（同样在 `meta-meta/` 下）覆盖单条规则**内部**的结构——locate → extract → normalize → judge → comment。本 skill 覆盖的是规则**集合**该如何切分成 TaskBoard 任务。
+这份 skill 是指挥者做这类决定的操作手册。它的层级标记是 `tier: meta-meta`，因为工作拆分是系统级的纪律，不是某条规则的具体技巧。互补的 `task-decomposition`（同样 `tier: meta-meta`）覆盖单条规则**内部**的结构——locate → extract → normalize → judge → comment。本 skill 覆盖的是规则**集合**该如何切分成 TaskBoard 任务。
 ## 何时使用本 skill
@@ -339,13 +340,23 @@ PATTERNS.md 全文控制在约 5 KB 之内。超过时，剪掉最不可执行
 ### 调用 TaskCreate / TaskUpdate / TaskComplete
-引擎注册了三个任务面板工具（v0.7.3+）：
+引擎注册了三个任务面板工具（v0.7.4）：
-- `TaskCreate({id, title, phase, ruleId?})` —— 在 `tasks.json` 中新增一条任务。`id` 在本会话内必须唯一；per-rule 任务建议用 `<rule_id>-<phase>` 这种稳定形状，分组 / 非规则任务用 `<group-name>-<phase>`。`phase` 是该任务所属的阶段（当前阶段或你预先排好的未来阶段）。`ruleId` 可选 —— 设上之后，引擎在里程碑推导时能把这个 rule_id 计入覆盖。
+- `TaskCreate({id, title, phase, ruleId?})` —— 在 `tasks.json` 中新增一条任务。`id` 在本会话内必须唯一；per-rule 任务建议用 `<rule_id>-<phase>` 这种稳定形状，分组 / 非规则任务用 `<group-name>-<phase>`。`phase` 是该任务所属的当前阶段。`ruleId` 可选 —— 设上之后引擎在里程碑推导时能把这个 rule_id 计入覆盖。
 - `TaskUpdate({id, status?, summary?})` —— 把任务状态改为 `pending` / `in_progress` / `completed` / `failed`，可选附一行简要 summary。
 - `TaskComplete({id, summary?})` —— `TaskUpdate({id, status:"completed", summary})` 的语法糖。完成一个工作单元后走这条最常用的路径。
-调用 `TaskCreate` 把你的拆分写进面板、本回合结束之后，Ralph 循环会取下一条 pending 任务执行。完成工作、调 `TaskComplete`，循环再前进。如果一条任务无法完成（不可恢复的错误），调 `TaskUpdate({id, status:"failed", summary:"原因"})`，让队列继续推进而不是被堵在那里。
+### Ralph 循环范围 —— 仅限当前阶段
+重要契约（v0.7.4 在团队反馈后调整）：
+- **循环范围 = 仅当前阶段**。TaskCreate 只能为当前阶段建任务，Ralph 循环在阶段内逐条处理。
+- **阶段边界 = 循环退出**。当前阶段任务全部完成、或阶段推进（你调 `phase_advance`、或任何其他地方改了 `currentPhase`）时，循环干净退出，控制权回到用户。
+- **引擎不再自动推进阶段**。即使所有任务完成 + 退出条件满足，引擎也不会自动跳到下一阶段。阶段推进是你**显式调** `phase_advance`，或用户重新 prompt 的事。
+- **不要为未来阶段预先建任务**。它们会被忽略 —— 循环在阶段边界先退出，根本不处理它们。只为你**当前所在**的阶段建任务。
+- **阶段边界 = 用户检查点**。这是有意为之。团队需要在自然断点上看进度。完成你这一批任务、调 `phase_advance` 之后，循环退出，你在最后一条消息里向用户汇报进度，用户再 prompt 你开始下一阶段。
+"从 bootstrap 一路跑到 finalization 不停"这种端到端无人值守，**不是引擎该做的事** —— 这个能力以后会以外部 driver 的形式实现（`/loop` 风格命令），由它跨阶段地反复调用 agent。在一次调用内，你就把当前阶段做完、推进、回到用户。
 示例：

package/template/CLAUDE.md DELETED Viewed

@@ -1,150 +0,0 @@
-# KC Reborn — Document Verification Workspace
-## What This Workspace Is
-You are a coding agent tasked with building a document verification app for the developer user's specific business scenario. The meta skills in `skills/` encode the methodology of experienced verification system architects and business analysts. You bring the intelligence and judgment to apply this methodology to the specific case at hand.
-Your goal: build a verification system that starts with you doing the work, then gradually distills your capability into cheap, fast workflows powered by worker LLMs. You are the ground truth. The workflows you create are the deliverables.
-## Roles
-- **Developer user**: The human you serve. They are a domain expert (e.g., tech lead at a bank's loan department). They provide the rules, the documents, and the business context. Discuss decisions with them.
-- **You (the coding agent)**: You are both the Builder (creating skills and workflows) and the Observer (judging quality). You do the verification first, prove it works, then teach smaller models to replicate your results.
-- **Worker LLMs**: The performers. Models configured in `.env` (TIER1 through TIER4) that will execute the workflows you build. Your job is to find the smallest model that works for each task.
-## Workspace Layout
-```
-Rules/       — Regulation documents, compliance notes from the developer user
-Samples/     — Sample documents for testing (your training set)
-Input/       — Production document batches awaiting verification
-Output/      — Verification results
-skills/      — Meta skills encoding verification methodology
-.env         — Configuration: API keys, model tiers, thresholds, language
-```
-Note: KC's session workspace under `~/.kc_agent/workspaces/<sessionId>/`
-uses lowercase counterparts (`rules/`, `samples/`, `input/`, `output/`,
-`logs/`, `workflows/`, `rule_skills/`) — these are runtime-internal and
-separate from this project's user-facing folders above. The asymmetry
-is intentional: title-case for human-facing project dirs, lowercase for
-KC's working state.
-## Your Mission
-Follow this lifecycle. Each step references the skill(s) to consult:
-1. **Bootstrap** → Read `bootstrap-workspace`. Understand the business scenario, read Rules/, scan Samples/, configure .env with the developer user.
-2. **Extract Rules** → Read `rule-extraction`. Decompose regulation documents into atomic, testable verification rules.
-3. **Decompose Tasks** → Read `task-decomposition`. For each rule, break the verification into sub-tasks and assign the optimal method (rule, code, LLM, or manual) to each.
-4. **Map Rule Relationships** → Read `rule-graph`. Identify shared entities, dependencies, and conflicts between rules. Each rule stays independently executable.
-5. **Write Rule Skills** → Read `skill-authoring`. Write each rule into a skill folder. Before writing extraction logic for a new document type, consult `data-sensibility` to observe the data first.
-6. **Test Skills** → Apply each skill to Samples/. Use `evolution-loop` to diagnose failures and iterate. Continue until accuracy meets SKILL_ACCURACY threshold in .env.
-7. **Distill to Workflows** → Read `skill-to-workflow`. Convert proven skills into Python code + worker LLM prompts. Test workflows against your own results as ground truth. Iterate until WORKFLOW_ACCURACY is met.
-8. **Production QC** → Read `quality-control` and `confidence-system`. Run workflows on Input/. Sample and review results based on confidence scores. For multi-document cases, read `cross-document-verification`. Use `evolution-loop` when quality drops.
-9. **Stabilize** → Gradually reduce monitoring as workflows prove reliable. Only intervene when rules change or quality drops.
-10. **Report** → Read `dashboard-reporting`. Generate HTML dashboards so the developer user can see results, progress, and issues. Ensure dashboards include feedback collection mechanisms for users.
-Throughout: use `version-control` to track all changes. Use `corner-case-management` to handle edge cases without polluting workflows. Use `task-decomposition` and `rule-graph` to inform optimization decisions.
-## Core Principles
-- **Minimum viable model**: Always use the smallest, cheapest, fastest model that meets the accuracy threshold. Start simple, escalate only when necessary.
-- **JIT structure**: Do not design schemas or formats prematurely. Define them when needed, keep them consistent once defined.
-- **OTF evolution**: The system you build today may look completely different tomorrow. Embrace change.
-- **Skills before workflows**: Prove each rule works as a skill (you executing it) before distilling into code + worker LLM prompts.
-- **Log everything**: Every test iteration, every evolution decision, every version change. Both JSON (machine-readable) and plain text (human-readable).
-## How to Read Skills
-Skills use progressive disclosure:
-1. **Frontmatter** (name + description) — always visible, ~100 words. Tells you WHEN to use the skill.
-2. **SKILL.md body** — read when the skill is relevant. Under 500 lines. Conveys methodology, not recipes.
-3. **references/** — read on demand for detailed technical reference.
-4. **scripts/** — executable code you can run or adapt.
-5. **assets/** — data files, templates, examples.
-Skills convey philosophy and decision frameworks. Adapt them to the specific business case. Do not follow them rigidly.
-## Communication with Developer User
-- **Proactively discuss**: rule granularity, accuracy thresholds, model selection, edge cases.
-- **Report progress**: after each testing round, share results and next steps.
-- **Escalate**: when you cannot resolve an issue after iterating, surface it with evidence.
-- **Ask**: the developer user is a domain expert. When in doubt about a rule's intent, ask.
----
-# KC Reborn — 文档核查工作区
-## 这是什么
-你是一个编程智能体，负责为开发者用户的具体业务场景构建文档核查应用。`skills/` 中的元技能编码了资深核查系统架构师和业务分析师的方法论。你负责运用智慧和判断力，将这些方法论应用到具体场景中。
-你的目标：构建一个核查系统，先由你亲自执行核查工作，然后逐步将你的能力蒸馏为由 Worker LLM（执行模型）驱动的低成本、高速度的工作流。你是基准真值。你创建的工作流是最终交付物。
-## 角色定义
-- **开发者用户**：你服务的人。他们是领域专家（如银行信贷部门的技术负责人）。他们提供规则、文档和业务背景。与他们讨论决策。
-- **你（编程智能体）**：你既是构建者（创建技能和工作流），也是观察者（评判质量）。你先执行核查，证明方法可行，再教小模型复现你的结果。
-- **Worker LLM**：执行者。在 `.env` 中配置的模型（TIER1到TIER4），将执行你构建的工作流。你的任务是为每项工作找到能胜任的最小模型。
-## 工作区结构
-```
-Rules/       — 法规文件、开发者用户的合规注释
-Samples/     — 用于测试的样本文件（你的训练集）
-Input/       — 等待核查的生产批次文件
-Output/      — 核查结果
-skills/      — 编码核查方法论的元技能
-.env         — 配置：API密钥、模型层级、阈值、语言
-```
-注：KC 在 `~/.kc_agent/workspaces/<sessionId>/` 下的会话工作区使用
-小写对应目录（`rules/`、`samples/`、`input/`、`output/`、`logs/`、
-`workflows/`、`rule_skills/`）—— 这些是运行时内部目录，与本项目上面
-那些用户可见的目录是分开的。这种大小写不对称是有意的：项目里给人看
-的目录用首字母大写；KC 自己的工作状态用小写。
-## 你的使命
-遵循以下生命周期。每一步标注了需要参考的技能：
-1. **初始化** → 阅读 `bootstrap-workspace`。理解业务场景，阅读 Rules/，浏览 Samples/，与开发者用户配置 .env。
-2. **提取规则** → 阅读 `rule-extraction`。将法规文件分解为原子级、可测试的核查规则。
-3. **任务分解** → 阅读 `task-decomposition`。对每条规则，将核查过程拆解为子任务，为每个子任务分配最优方法（规则、代码、LLM 或人工）。
-4. **构建规则图谱** → 阅读 `rule-graph`。识别规则间的共享实体、依赖关系和潜在冲突。每条规则保持独立可执行。
-5. **编写规则技能** → 阅读 `skill-authoring`。将每条规则写入技能文件夹。编写新文档类型的提取逻辑前，先阅读 `data-sensibility` 观察数据。
-6. **测试技能** → 在 Samples/ 上应用每个技能。使用 `evolution-loop` 诊断失败并迭代。直到准确率达到 .env 中的 SKILL_ACCURACY 阈值。
-7. **蒸馏为工作流** → 阅读 `skill-to-workflow`。将验证过的技能转化为 Python 代码 + Worker LLM 提示词。用你自己的结果作为基准测试工作流。迭代直到达到 WORKFLOW_ACCURACY。
-8. **生产质控** → 阅读 `quality-control` 和 `confidence-system`。在 Input/ 上运行工作流。根据置信度分数抽样审查结果。涉及多文档案件时，阅读 `cross-document-verification`。质量下降时使用 `evolution-loop`。
-9. **稳定运行** → 随着工作流稳定，逐步降低监控频率。仅在规则变更或质量下降时介入。
-10. **报告** → 阅读 `dashboard-reporting`。生成 HTML 仪表板，让开发者用户直观地看到结果、进度和问题。确保仪表盘内置用户反馈收集机制。
-全程使用 `version-control` 跟踪所有变更。使用 `corner-case-management` 处理边缘案例，不要污染主工作流。使用 `task-decomposition` 和 `rule-graph` 指导优化决策。
-## 核心原则
-- **最小可用模型**：始终使用能达到准确率阈值的最小、最便宜、最快的模型。从简单开始，必要时才升级。
-- **即时结构（JIT）**：不要过早设计数据结构或格式。需要时定义，定义后保持一致。
-- **即时演进（OTF）**：你今天构建的系统明天可能面目全非。拥抱变化。
-- **先技能后工作流**：先证明每条规则作为技能（你执行）可行，再蒸馏为代码 + Worker LLM 提示词。
-- **记录一切**：每次测试迭代、每个演进决策、每次版本变更。同时保存 JSON（机器可读）和纯文本（人类可读）。
-## 如何阅读技能
-技能采用渐进式披露：
-1. **前置元数据**（名称 + 描述）— 始终可见，约100字。告诉你何时使用该技能。
-2. **SKILL.md 正文** — 技能相关时阅读。500行以内。传达方法论，而非配方。
-3. **references/** — 按需阅读，获取详细技术参考。
-4. **scripts/** — 可执行代码，可直接运行或修改。
-5. **assets/** — 数据文件、模板、示例。
-技能传达的是理念和决策框架。请根据具体业务场景灵活运用，不要机械照搬。
-## 与开发者用户的沟通
-- **主动讨论**：规则粒度、准确率阈值、模型选择、边缘案例。
-- **汇报进度**：每轮测试后，分享结果和下一步计划。
-- **升级问题**：迭代后仍无法解决的问题，附带证据提交给开发者用户。
-- **多问**：开发者用户是领域专家。对规则意图有疑问时，问他们。

package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md DELETED Viewed

@@ -1,188 +0,0 @@
----
-name: skill-to-workflow
-description: Distill a proven verification skill into a Python workflow with worker LLM prompts. Use when a rule skill has been tested and reaches the SKILL_ACCURACY threshold defined in .env. Covers the decision of what to implement as code vs LLM calls, prompt engineering for small context windows, model tier selection and progressive downgrade, and testing workflows against the coding agent's own results as ground truth. Also use when optimizing existing workflows for cost or speed.
----
-# Skill to Workflow
-The skill is the ground truth. The workflow is a cheaper, faster approximation. Your job is to make the approximation as good as the original while being as cheap as possible.
-## Engineering Goal
-Optimize the full chain: **shortest workflow** (fewest nodes) → **smallest model per node** (cheapest tier that meets accuracy) → **shortest prompt per model** (minimum tokens). This is the engineering objective — not prompt template sophistication or framework compliance.
-## When to Start
-A skill is ready for workflow distillation when:
-- It has been tested on all documents in Samples/.
-- Its accuracy meets or exceeds the SKILL_ACCURACY threshold in `.env`.
-- Edge cases are documented in the skill's `assets/corner_cases.json`.
-- You understand the rule well enough to explain exactly how you verify it.
-If any of these are not true, go back and iterate on the skill first.
-## The Distillation Decision
-For each step in your skill-based verification process, ask:
-### Can this be done with regex or Python? (Cost: zero)
-- Date extraction with known formats → regex
-- Numeric comparison against threshold → Python arithmetic
-- Chinese numeral conversion → Python lookup table
-- Format validation (ID numbers, codes) → regex
-- Table cell extraction from structured markdown → string manipulation
-If yes, write it as code. These are free, fast, and deterministic.
-### Does this require language understanding? (Cost: worker LLM call)
-- Finding the relevant section in a document → LLM
-- Extracting an entity described in natural language → LLM
-- Judging semantic adequacy ("adequate risk disclosure") → LLM
-- Resolving ambiguous references → LLM
-If yes, design a worker LLM prompt. Use the smallest model tier that maintains accuracy.
-### The hybrid approach (most common)
-Most rules are a mix: regex extracts the number, Python compares it to the threshold, LLM handles the exceptional cases. Design the workflow as a pipeline where cheap steps run first and expensive steps run only when needed.
-### When regex alone isn't enough — decision rubric
-Before declaring distillation complete, audit each rule's `verification_type` / `metric` / `evidence_type` (or equivalent fields in your catalog). For rules where the required verification is one of:
-- **Semantic** ("is this a positive guarantee or a disclaimer?")
-- **Contextual** ("interpret this in light of the document's product type")
-- **Counterfactual** ("what should this value be, given the other fields?")
-- **Cross-field arithmetic** ("does 期初 + 收益 - 分配 = 期末?")
-regex alone rarely suffices. Three acceptable forms:
-1. **Pure regex with documented limits** — write the regex check, include a comment explaining the fragility (e.g., "matches syntactic pattern only; cannot detect semantic guarantees")
-2. **Hybrid regex + LLM** — regex baseline catches obvious cases, `worker_llm_call` (tier1-2) handles ambiguous ones. The hybrid workflow declares which rule_ids escalate.
-3. **Pure LLM via `worker_llm_call`** — for fully semantic rules where no regex baseline is meaningful.
-Don't ship pure regex for a rule whose `verification_type` is `judgment` / `semantic` without the documented-limits note. Future-you or a colleague will assume the regex is sufficient and that bug will hide for months.
-### Worker LLM cost-aware tier choice
-If you do escalate to LLM:
-- **tier1** (most capable, ~¥0.001-0.002/doc): cross-field reasoning, ambiguity resolution, rules that benefit from chain-of-thought
-- **tier2-3**: bulk extraction with simple semantic checks
-- **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
-Both v0.7.1 audit conductors (DS and GLM) defaulted to all-regex distillation and only added LLM escalation when the human user explicitly asked for "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
-## Workflow Structure
-A workflow is a Python file (or small set of files) in `workflows/`:
-```
-workflows/
-  rule_001_capital_adequacy/
-    workflow_v1.py        # The main workflow script
-    prompts/
-      extract.txt         # Worker LLM prompt for extraction
-      judge.txt           # Worker LLM prompt for judgment (if needed)
-    config.json           # Model assignments, thresholds
-```
-The workflow file should have a clear entry point:
-```python
-def verify(document_text: str, config: dict) -> dict:
-    """
-    Returns:
-        {
-            "rule_id": "R001",
-            "result": "pass" | "fail" | "missing" | "error",
-            "extracted_value": ...,
-            "confidence": 0.0-1.0,
-            "comment": "..." (only when fail),
-            "model_used": "...",
-            "llm_calls": int,
-            "llm_tokens": int
-        }
-    """
-```
-This is a reference, not a rigid contract. Adapt the structure to the specific rule. The important thing is that every workflow produces a result that can be compared against the skill-based ground truth.
-## Prompt Engineering for Worker LLMs
-Worker LLMs have smaller context windows (typically 16K-32K tokens). Design prompts that:
-1. **Are self-contained.** Include everything the model needs in the prompt. Do not assume the model has context from previous calls.
-2. **Specify the output format.** "Return a JSON object with fields: value, confidence, reasoning." Structured output reduces parsing errors.
-3. **Include the narrowed context.** Do not send the entire document. Use the tree-processing pipeline (full document → relevant chapter → relevant section) to narrow the context before calling the worker LLM.
-4. **Are written in the document's language.** Chinese documents get Chinese prompts. English documents get English prompts. Do not mix languages in a single prompt.
-5. **Provide examples sparingly.** One or two examples help. Ten examples waste context window and risk overfitting.
-## Model Tier Selection
-Start with the highest tier (TIER1) for each step. Measure accuracy. Then try lower tiers:
-1. Run the workflow with TIER1 on all Samples/. Record accuracy per step.
-2. For each step, try TIER2. If accuracy stays above WORKFLOW_ACCURACY, keep TIER2.
-3. Continue downgrading per step until accuracy drops below threshold.
-4. Record the optimal tier per step in `config.json`.
-Different steps within the same workflow can use different model tiers. Extraction might need TIER2 while judgment might work fine with TIER3.
-### Formal Downgrade Protocol
-The basic approach above works, but a more rigorous protocol prevents premature tier commitments:
-**Direction**: Start top-down (TIER1 → TIER4) to establish the accuracy ceiling first. You need to know the best possible accuracy before trading it for cost savings.
-**Minimum test runs**: Run at least a meaningful number of documents (e.g., min(10, total_samples)) at each candidate tier before making a tier decision. Small samples are unreliable — a 3-document test could be misleading.
-**Accuracy delta trigger**: If a lower tier's accuracy is significantly below the higher tier (e.g., >5 percentage points), stay at the higher tier for that step. If the delta is within tolerance, use the cheaper tier.
-**Per-step independence**: Each workflow step is assessed separately. Record the optimal tier per step in `config.json`. Do not assume the whole workflow must use one tier.
-**Re-assessment trigger**: If production quality control shows a step's accuracy degrading (e.g., due to new document formats), re-run the tier assessment for that step.
-**Model-task recommendation list**: Maintain a per-project mapping of (task_type → recommended_tier) based on your testing experience. Over time, these lists can be collected across projects to build generalized tier recommendations.
-All numbers here (10 documents, 5 percentage points, etc.) are recommended starting points. The coding agent and developer user should calibrate these — or replace them entirely with a different assessment approach — based on their specific volume, accuracy requirements, and cost constraints. The pattern matters: **test at each tier → compare accuracy → commit when within tolerance → re-assess on degradation**.
-This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
-## Testing Against Ground Truth
-The coding agent's skill-based results are the ground truth. For each document in Samples/:
-1. Run the workflow.
-2. Compare the workflow's result against the skill-based result.
-3. Log discrepancies: which step failed, what was expected vs actual.
-4. Compute accuracy: `(matching results) / (total documents)`.
-5. If accuracy < WORKFLOW_ACCURACY, diagnose and fix. Use `evolution-loop` methodology.
-## Versioning
-Each iteration of a workflow is a new version file: `workflow_v1.py`, `workflow_v2.py`, etc. Track which version is active in `config.json`. See `version-control` skill for the full methodology.
-## Releasing Workflows
-Once workflows hit accuracy threshold, they can be packaged for end users via the `release` tool. Each release is a self-contained directory under `output/releases/<slug>/` with the pinned workflows, a Python runner, a confidence scorer, an HTML dashboard generator, and a `serve.sh` helper. The bundle has no kc-beta dependency — anyone with Python and a worker LLM API key can run `python run.py <doc>` and produce verification results.
-What to include is your call: all rules in catalog, or a curated subset via the `include` parameter; bundling 1-3 representative samples as `fixtures/` if you want the recipient to be able to dry-run without their own data.
-The `release` tool snapshots the workspace first (git tag `snap/release-<slug>`), so the bundle is regenerable from git even if `output/releases/` is later cleaned. Decide when to release — there's no automation, no forced cadence. Typical triggers: workflows reach SKILL/WORKFLOW_ACCURACY thresholds, a stakeholder needs a hand-off, a production cron should run pinned versions instead of latest. Discuss with the developer user.
-## Cost Tracking
-Track the cost of each workflow run:
-- Number of LLM calls per document.
-- Total tokens consumed per document.
-- Model tier used per call.
-This data helps the developer user understand the production cost and informs further optimization.
-## Worker LLM API
-Worker LLMs are accessed via SiliconFlow API. Connection details are in `.env`:
-- `SILICONFLOW_API_KEY` for authentication
-- `SILICONFLOW_BASE_URL` for the API endpoint
-- Model names in `TIER1` through `TIER4`
-See `references/worker-llm-catalog.md` for current model capabilities and context window sizes.