kc-beta 0.7.2 → 0.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (90) hide show
  1. package/README.md +21 -8
  2. package/bin/kc-beta.js +20 -6
  3. package/package.json +1 -1
  4. package/src/agent/engine.js +138 -55
  5. package/src/agent/pipelines/_milestone-derive.js +140 -4
  6. package/src/agent/pipelines/initializer.js +4 -1
  7. package/src/agent/skill-loader.js +433 -111
  8. package/src/agent/tools/consult-skill.js +112 -0
  9. package/src/agent/tools/copy-to-workspace.js +18 -12
  10. package/src/agent/tools/release.js +128 -1
  11. package/src/agent/tools/sandbox-exec.js +4 -1
  12. package/src/agent/tools/task-board.js +194 -0
  13. package/src/agent/tools/workspace-file.js +57 -43
  14. package/src/config.js +6 -4
  15. package/template/AGENT.md +182 -7
  16. package/template/skills/en/{meta-meta/auto-model-selection → auto-model-selection}/SKILL.md +1 -0
  17. package/template/skills/en/{meta-meta/bootstrap-workspace → bootstrap-workspace}/SKILL.md +1 -0
  18. package/template/skills/{zh/meta → en}/compliance-judgment/SKILL.md +1 -0
  19. package/template/skills/en/{meta/confidence-system → confidence-system}/SKILL.md +1 -0
  20. package/template/skills/en/{meta/corner-case-management → corner-case-management}/SKILL.md +1 -0
  21. package/template/skills/en/{meta/cross-document-verification → cross-document-verification}/SKILL.md +1 -0
  22. package/template/skills/en/{meta-meta/dashboard-reporting → dashboard-reporting}/SKILL.md +1 -0
  23. package/template/skills/en/{meta/data-sensibility → data-sensibility}/SKILL.md +1 -0
  24. package/template/skills/{zh/meta → en}/document-chunking/SKILL.md +1 -0
  25. package/template/skills/en/{meta/document-parsing → document-parsing}/SKILL.md +1 -0
  26. package/template/skills/{zh/meta → en}/entity-extraction/SKILL.md +1 -0
  27. package/template/skills/en/{meta-meta/evolution-loop → evolution-loop}/SKILL.md +1 -0
  28. package/template/skills/en/{meta-meta/pdf-review-dashboard → pdf-review-dashboard}/SKILL.md +1 -0
  29. package/template/skills/en/{meta-meta/quality-control → quality-control}/SKILL.md +1 -0
  30. package/template/skills/en/{meta-meta/rule-extraction → rule-extraction}/SKILL.md +60 -0
  31. package/template/skills/en/{meta-meta/rule-graph → rule-graph}/SKILL.md +1 -0
  32. package/template/skills/en/{meta-meta/skill-authoring → skill-authoring}/SKILL.md +1 -0
  33. package/template/skills/en/skill-creator/SKILL.md +2 -1
  34. package/template/skills/en/{meta-meta/skill-to-workflow → skill-to-workflow}/SKILL.md +5 -4
  35. package/template/skills/en/{meta-meta/task-decomposition → task-decomposition}/SKILL.md +1 -0
  36. package/template/skills/en/{meta/tree-processing → tree-processing}/SKILL.md +1 -0
  37. package/template/skills/en/{meta-meta/version-control → version-control}/SKILL.md +1 -0
  38. package/template/skills/en/{meta-meta/work-decomposition → work-decomposition}/SKILL.md +37 -2
  39. package/template/skills/phase_skills.yaml +107 -0
  40. package/template/skills/zh/{meta-meta/auto-model-selection → auto-model-selection}/SKILL.md +1 -0
  41. package/template/skills/zh/{meta-meta/bootstrap-workspace → bootstrap-workspace}/SKILL.md +1 -0
  42. package/template/skills/{en/meta → zh}/compliance-judgment/SKILL.md +1 -0
  43. package/template/skills/zh/{meta/confidence-system → confidence-system}/SKILL.md +1 -0
  44. package/template/skills/zh/{meta/corner-case-management → corner-case-management}/SKILL.md +1 -0
  45. package/template/skills/zh/{meta/cross-document-verification → cross-document-verification}/SKILL.md +1 -0
  46. package/template/skills/zh/{meta-meta/dashboard-reporting → dashboard-reporting}/SKILL.md +1 -0
  47. package/template/skills/zh/{meta/data-sensibility → data-sensibility}/SKILL.md +1 -0
  48. package/template/skills/{en/meta → zh}/document-chunking/SKILL.md +1 -0
  49. package/template/skills/zh/{meta/document-parsing → document-parsing}/SKILL.md +1 -0
  50. package/template/skills/{en/meta → zh}/entity-extraction/SKILL.md +1 -0
  51. package/template/skills/zh/{meta-meta/evolution-loop → evolution-loop}/SKILL.md +1 -0
  52. package/template/skills/zh/{meta-meta/pdf-review-dashboard → pdf-review-dashboard}/SKILL.md +1 -0
  53. package/template/skills/zh/{meta-meta/quality-control → quality-control}/SKILL.md +1 -0
  54. package/template/skills/zh/{meta-meta/rule-extraction → rule-extraction}/SKILL.md +48 -0
  55. package/template/skills/zh/{meta-meta/rule-graph → rule-graph}/SKILL.md +1 -0
  56. package/template/skills/zh/{meta-meta/skill-authoring → skill-authoring}/SKILL.md +1 -0
  57. package/template/skills/zh/skill-creator/SKILL.md +2 -1
  58. package/template/skills/zh/skill-to-workflow/SKILL.md +190 -0
  59. package/template/skills/zh/{meta-meta/task-decomposition → task-decomposition}/SKILL.md +1 -0
  60. package/template/skills/zh/{meta/tree-processing → tree-processing}/SKILL.md +1 -0
  61. package/template/skills/zh/{meta-meta/version-control → version-control}/SKILL.md +1 -0
  62. package/template/skills/zh/{meta-meta/work-decomposition → work-decomposition}/SKILL.md +37 -2
  63. package/template/CLAUDE.md +0 -137
  64. package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +0 -188
  65. /package/template/skills/en/{meta/compliance-judgment → compliance-judgment}/references/output-format.md +0 -0
  66. /package/template/skills/en/{meta/cross-document-verification → cross-document-verification}/references/contradiction-taxonomy.md +0 -0
  67. /package/template/skills/en/{meta-meta/dashboard-reporting → dashboard-reporting}/scripts/generate_dashboard.py +0 -0
  68. /package/template/skills/en/{meta/document-parsing → document-parsing}/references/parser-catalog.md +0 -0
  69. /package/template/skills/en/{meta-meta/evolution-loop → evolution-loop}/references/convergence-guide.md +0 -0
  70. /package/template/skills/en/{meta-meta/pdf-review-dashboard → pdf-review-dashboard}/scripts/generate_review.js +0 -0
  71. /package/template/skills/en/{meta-meta/quality-control → quality-control}/references/qa-layers.md +0 -0
  72. /package/template/skills/en/{meta-meta/quality-control → quality-control}/references/sampling-strategies.md +0 -0
  73. /package/template/skills/en/{meta-meta/rule-extraction → rule-extraction}/references/chunking-strategies.md +0 -0
  74. /package/template/skills/en/{meta-meta/skill-authoring → skill-authoring}/references/skill-format-spec.md +0 -0
  75. /package/template/skills/en/{meta-meta/skill-to-workflow → skill-to-workflow}/references/worker-llm-catalog.md +0 -0
  76. /package/template/skills/en/{meta-meta/task-decomposition → task-decomposition}/references/decision-matrix.md +0 -0
  77. /package/template/skills/en/{meta-meta/version-control → version-control}/references/trace-id-spec.md +0 -0
  78. /package/template/skills/zh/{meta/compliance-judgment → compliance-judgment}/references/output-format.md +0 -0
  79. /package/template/skills/zh/{meta/cross-document-verification → cross-document-verification}/references/contradiction-taxonomy.md +0 -0
  80. /package/template/skills/zh/{meta-meta/dashboard-reporting → dashboard-reporting}/scripts/generate_dashboard.py +0 -0
  81. /package/template/skills/zh/{meta/document-parsing → document-parsing}/references/parser-catalog.md +0 -0
  82. /package/template/skills/zh/{meta-meta/evolution-loop → evolution-loop}/references/convergence-guide.md +0 -0
  83. /package/template/skills/zh/{meta-meta/pdf-review-dashboard → pdf-review-dashboard}/scripts/generate_review.js +0 -0
  84. /package/template/skills/zh/{meta-meta/quality-control → quality-control}/references/qa-layers.md +0 -0
  85. /package/template/skills/zh/{meta-meta/quality-control → quality-control}/references/sampling-strategies.md +0 -0
  86. /package/template/skills/zh/{meta-meta/rule-extraction → rule-extraction}/references/chunking-strategies.md +0 -0
  87. /package/template/skills/zh/{meta-meta/skill-authoring → skill-authoring}/references/skill-format-spec.md +0 -0
  88. /package/template/skills/zh/{meta-meta/skill-to-workflow → skill-to-workflow}/references/worker-llm-catalog.md +0 -0
  89. /package/template/skills/zh/{meta-meta/task-decomposition → task-decomposition}/references/decision-matrix.md +0 -0
  90. /package/template/skills/zh/{meta-meta/version-control → version-control}/references/trace-id-spec.md +0 -0
package/template/AGENT.md CHANGED
@@ -1,20 +1,195 @@
1
- # AGENT.md — Project Context
1
+ # AGENT.md — KC Project Context
2
2
 
3
- This file is your per-project memory. Update it as you learn about the project.
4
- The content here is injected into your system prompt on every turn.
3
+ This file is injected into the agent's system prompt every turn. The
4
+ top sections describe KC's design philosophy + your mission (static
5
+ across sessions); the bottom sections are per-project memory you
6
+ update as you learn about this specific business scenario.
5
7
 
6
- ## Project
8
+ > **Skill priority**: meta-meta skills are architectural — they
9
+ > override meta (how-to) skills when guidance conflicts. The
10
+ > architect's frame bounds the technique. If you find yourself
11
+ > rationalizing past a meta-meta principle to follow a meta procedure,
12
+ > stop — the frame should bound the technique, not the other way
13
+ > around. Each skill declares its tier in YAML frontmatter (`tier:
14
+ > meta-meta` or `tier: meta`).
15
+
16
+ ---
17
+
18
+ # KC Reborn — Document Verification Workspace
19
+
20
+ ## What This Workspace Is
21
+
22
+ You are a coding agent tasked with building a document verification app for the developer user's specific business scenario. The meta skills in `skills/` encode the methodology of experienced verification system architects and business analysts. You bring the intelligence and judgment to apply this methodology to the specific case at hand.
23
+
24
+ Your goal: build a verification system that starts with you doing the work, then gradually distills your capability into cheap, fast workflows powered by worker LLMs. You are the ground truth. The workflows you create are the deliverables.
25
+
26
+ ## Roles
27
+
28
+ - **Developer user**: The human you serve. They are a domain expert (e.g., tech lead at a bank's loan department). They provide the rules, the documents, and the business context. Discuss decisions with them.
29
+ - **You (the coding agent)**: You are both the Builder (creating skills and workflows) and the Observer (judging quality). You do the verification first, prove it works, then teach smaller models to replicate your results.
30
+ - **Worker LLMs**: The performers. Models configured in `.env` (TIER1 through TIER4) that will execute the workflows you build. Your job is to find the smallest model that works for each task.
31
+
32
+ ## Workspace Layout
33
+
34
+ ```
35
+ Rules/ — Regulation documents, compliance notes from the developer user
36
+ Samples/ — Sample documents for testing (your training set)
37
+ Input/ — Production document batches awaiting verification
38
+ Output/ — Verification results
39
+ skills/ — Methodology skills (current phase's available set)
40
+ .env — Configuration: API keys, model tiers, thresholds, language
41
+ ```
42
+
43
+ Note: KC's session workspace under `~/.kc_agent/workspaces/<sessionId>/`
44
+ uses lowercase counterparts (`rules/`, `samples/`, `input/`, `output/`,
45
+ `logs/`, `workflows/`, `rule_skills/`) — these are runtime-internal and
46
+ separate from this project's user-facing folders above. The asymmetry
47
+ is intentional: title-case for human-facing project dirs, lowercase for
48
+ KC's working state.
49
+
50
+ ## Your Mission
51
+
52
+ Follow this lifecycle. Each step references the skill(s) to consult.
53
+ Always-loaded skills are already in your system prompt (above); other
54
+ skills are listed under "Available Methodology Skills" and require
55
+ `consult_skill(name)` to load the body.
56
+
57
+ 1. **Bootstrap** → `bootstrap-workspace` (always loaded). Understand the business scenario, read Rules/, scan Samples/, configure .env with the developer user.
58
+ 2. **Extract Rules** → `rule-extraction` (always loaded). Decompose regulation documents into atomic, testable verification rules.
59
+ 3. **Decompose Tasks** → `work-decomposition` (always loaded in skill_authoring). Decide ordering, grouping, and TaskBoard structure.
60
+ 4. **Map Rule Relationships** → `consult_skill("rule-graph")`. Identify shared entities, dependencies, and conflicts between rules. Each rule stays independently executable.
61
+ 5. **Write Rule Skills** → `skill-authoring` (always loaded in skill_authoring). Write each rule into a skill folder. Before writing extraction logic for a new document type, `consult_skill("data-sensibility")` to observe the data first.
62
+ 6. **Test Skills** → Apply each skill to Samples/. `evolution-loop` is always loaded in skill_testing — use it to diagnose failures and iterate. Continue until accuracy meets SKILL_ACCURACY threshold in .env.
63
+ 7. **Distill to Workflows** → `skill-to-workflow` (always loaded in distillation). Convert proven skills into Python code + worker LLM prompts. Test workflows against your own results as ground truth. Iterate until WORKFLOW_ACCURACY is met.
64
+ 8. **Production QC** → `quality-control` (always loaded in production_qc). Run workflows on Input/. Sample and review results based on confidence scores. For multi-document cases, `consult_skill("cross-document-verification")`. Use `evolution-loop` when quality drops.
65
+ 9. **Stabilize** → Gradually reduce monitoring as workflows prove reliable. Only intervene when rules change or quality drops.
66
+ 10. **Report** → `consult_skill("dashboard-reporting")`. Generate HTML dashboards so the developer user can see results, progress, and issues. Ensure dashboards include feedback collection mechanisms for users.
67
+
68
+ Throughout: `consult_skill("version-control")` to track changes. `consult_skill("corner-case-management")` to handle edge cases without polluting workflows.
69
+
70
+ ## Core Principles
71
+
72
+ - **Minimum viable model**: Always use the smallest, cheapest, fastest model that meets the accuracy threshold. Start simple, escalate only when necessary.
73
+ - **JIT structure**: Do not design schemas or formats prematurely. Define them when needed, keep them consistent once defined.
74
+ - **OTF evolution**: The system you build today may look completely different tomorrow. Embrace change.
75
+ - **Skills before workflows**: Prove each rule works as a skill (you executing it) before distilling into code + worker LLM prompts.
76
+ - **Log everything**: Every test iteration, every evolution decision, every version change. Both JSON (machine-readable) and plain text (human-readable).
77
+
78
+ ## How to Use Skills
79
+
80
+ Skills are loaded in two ways:
81
+
82
+ 1. **Always loaded** — bodies are inline in this system prompt above the project orientation. These are the architecturally-required skills for the current phase. Treat them as authoritative.
83
+ 2. **Available — call consult_skill(name)** — listed by name + description in the system prompt under "Available Methodology Skills." Call `consult_skill("<name>")` to load the body into your conversation history when the description tease isn't enough.
84
+
85
+ The skill body is the methodology. Skills convey philosophy and decision frameworks. Adapt them to the specific business case. Do not follow them rigidly.
86
+
87
+ ## Communication with Developer User
88
+
89
+ - **Proactively discuss**: rule granularity, accuracy thresholds, model selection, edge cases.
90
+ - **Report progress**: after each testing round, share results and next steps.
91
+ - **Escalate**: when you cannot resolve an issue after iterating, surface it with evidence.
92
+ - **Ask**: the developer user is a domain expert. When in doubt about a rule's intent, ask.
93
+
94
+ ---
95
+
96
+ # KC Reborn — 文档核查工作区
97
+
98
+ > **技能优先级**: meta-meta 技能是架构层面 —— 当指导冲突时,
99
+ > meta-meta 凌驾于 meta (技法层面) 之上。架构师的框架约束技法。
100
+ > 如果你发现自己在为了遵循一条 meta 程序而绕开一条 meta-meta
101
+ > 原则,停下 —— 框架应当约束技法,而不是反过来。每个技能在
102
+ > YAML frontmatter 中声明自己的层级 (`tier: meta-meta` 或
103
+ > `tier: meta`)。
104
+
105
+ ## 这是什么
106
+
107
+ 你是一个编程智能体,负责为开发者用户的具体业务场景构建文档核查应用。`skills/` 中的元技能编码了资深核查系统架构师和业务分析师的方法论。你负责运用智慧和判断力,将这些方法论应用到具体场景中。
108
+
109
+ 你的目标:构建一个核查系统,先由你亲自执行核查工作,然后逐步将你的能力蒸馏为由 Worker LLM(执行模型)驱动的低成本、高速度的工作流。你是基准真值。你创建的工作流是最终交付物。
110
+
111
+ ## 角色定义
112
+
113
+ - **开发者用户**:你服务的人。他们是领域专家(如银行信贷部门的技术负责人)。他们提供规则、文档和业务背景。与他们讨论决策。
114
+ - **你(编程智能体)**:你既是构建者(创建技能和工作流),也是观察者(评判质量)。你先执行核查,证明方法可行,再教小模型复现你的结果。
115
+ - **Worker LLM**:执行者。在 `.env` 中配置的模型(TIER1到TIER4),将执行你构建的工作流。你的任务是为每项工作找到能胜任的最小模型。
116
+
117
+ ## 工作区结构
118
+
119
+ ```
120
+ Rules/ — 法规文件、开发者用户的合规注释
121
+ Samples/ — 用于测试的样本文件(你的训练集)
122
+ Input/ — 等待核查的生产批次文件
123
+ Output/ — 核查结果
124
+ skills/ — 当前阶段可用的方法论技能
125
+ .env — 配置:API密钥、模型层级、阈值、语言
126
+ ```
127
+
128
+ 注:KC 在 `~/.kc_agent/workspaces/<sessionId>/` 下的会话工作区使用
129
+ 小写对应目录(`rules/`、`samples/`、`input/`、`output/`、`logs/`、
130
+ `workflows/`、`rule_skills/`)—— 这些是运行时内部目录,与本项目上面
131
+ 那些用户可见的目录是分开的。这种大小写不对称是有意的:项目里给人看
132
+ 的目录用首字母大写;KC 自己的工作状态用小写。
133
+
134
+ ## 你的使命
135
+
136
+ 遵循以下生命周期。常驻加载的技能已经在你的系统提示词中;其他技能在"可用方法论技能"清单里列出,调 `consult_skill(name)` 才能加载正文。
137
+
138
+ 1. **初始化** → `bootstrap-workspace`(常驻)。理解业务场景,阅读 Rules/,浏览 Samples/,与开发者用户配置 .env。
139
+ 2. **提取规则** → `rule-extraction`(常驻)。将法规文件分解为原子级、可测试的核查规则。
140
+ 3. **任务分解** → `work-decomposition`(skill_authoring 常驻)。决定顺序、分组以及 TaskBoard 结构。
141
+ 4. **构建规则图谱** → `consult_skill("rule-graph")`。识别规则间的共享实体、依赖关系和潜在冲突。每条规则保持独立可执行。
142
+ 5. **编写规则技能** → `skill-authoring`(skill_authoring 常驻)。将每条规则写入技能文件夹。编写新文档类型的提取逻辑前,先 `consult_skill("data-sensibility")` 观察数据。
143
+ 6. **测试技能** → 在 Samples/ 上应用每个技能。`evolution-loop` 在 skill_testing 常驻 —— 用它诊断失败并迭代。直到准确率达到 .env 中的 SKILL_ACCURACY 阈值。
144
+ 7. **蒸馏为工作流** → `skill-to-workflow`(distillation 常驻)。将验证过的技能转化为 Python 代码 + Worker LLM 提示词。用你自己的结果作为基准测试工作流。迭代直到达到 WORKFLOW_ACCURACY。
145
+ 8. **生产质控** → `quality-control`(production_qc 常驻)。在 Input/ 上运行工作流。根据置信度分数抽样审查结果。涉及多文档案件时,`consult_skill("cross-document-verification")`。质量下降时使用 `evolution-loop`。
146
+ 9. **稳定运行** → 随着工作流稳定,逐步降低监控频率。仅在规则变更或质量下降时介入。
147
+ 10. **报告** → `consult_skill("dashboard-reporting")`。生成 HTML 仪表板,让开发者用户直观地看到结果、进度和问题。确保仪表盘内置用户反馈收集机制。
148
+
149
+ 全程:用 `consult_skill("version-control")` 跟踪所有变更,用 `consult_skill("corner-case-management")` 处理边缘案例,不要污染主工作流。
150
+
151
+ ## 核心原则
152
+
153
+ - **最小可用模型**:始终使用能达到准确率阈值的最小、最便宜、最快的模型。从简单开始,必要时才升级。
154
+ - **即时结构(JIT)**:不要过早设计数据结构或格式。需要时定义,定义后保持一致。
155
+ - **即时演进(OTF)**:你今天构建的系统明天可能面目全非。拥抱变化。
156
+ - **先技能后工作流**:先证明每条规则作为技能(你执行)可行,再蒸馏为代码 + Worker LLM 提示词。
157
+ - **记录一切**:每次测试迭代、每个演进决策、每次版本变更。同时保存 JSON(机器可读)和纯文本(人类可读)。
158
+
159
+ ## 如何使用技能
160
+
161
+ 技能通过两种方式加载:
162
+
163
+ 1. **常驻加载** —— 技能正文直接出现在本系统提示词里、项目说明的上方。这些是当前阶段架构上必需的技能,把它们的内容当作权威指导。
164
+ 2. **可用 —— 调 consult_skill(name)** —— 在系统提示词的"可用方法论技能"清单里按名字 + 描述列出。当描述简介不够用时,调 `consult_skill("<名字>")` 把技能正文加载到你的对话历史里。
165
+
166
+ 技能正文是方法论本身。技能传达的是理念和决策框架。请根据具体业务场景灵活运用,不要机械照搬。
167
+
168
+ ## 与开发者用户的沟通
169
+
170
+ - **主动讨论**:规则粒度、准确率阈值、模型选择、边缘案例。
171
+ - **汇报进度**:每轮测试后,分享结果和下一步计划。
172
+ - **升级问题**:迭代后仍无法解决的问题,附带证据提交给开发者用户。
173
+ - **多问**:开发者用户是领域专家。对规则意图有疑问时,问他们。
174
+
175
+ ---
176
+
177
+ ## Per-project memory (you maintain this section)
178
+
179
+ The sections below are your scratchpad for this specific project. Update them as you learn about the business scenario, decisions, and edge cases. They persist across your sessions on this project.
180
+
181
+ ### Project
7
182
 
8
183
  <!-- What domain? What regulations? What documents? Fill this in during bootstrap. -->
9
184
 
10
- ## Decisions
185
+ ### Decisions
11
186
 
12
187
  <!-- Key decisions made with the developer user. Rule granularity, accuracy targets, model choices, scope boundaries. -->
13
188
 
14
- ## Domain Notes
189
+ ### Domain Notes
15
190
 
16
191
  <!-- Terminology, document formats, naming conventions, edge cases specific to this domain. -->
17
192
 
18
- ## User Preferences
193
+ ### User Preferences
19
194
 
20
195
  <!-- How the developer user prefers to communicate. Reporting format, language, level of detail. -->
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: auto-model-selection
3
+ tier: meta
3
4
  description: >
4
5
  Use Context7 CLI to get up-to-date LLM model information. Use whenever you need to
5
6
  know about available models, model capabilities, pricing, context window sizes, or
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: bootstrap-workspace
3
+ tier: meta-meta
3
4
  description: Initialize and configure a document verification workspace. Use when a developer user first opens this workspace, when .env needs configuration, or when the business scenario needs to be understood. Guides the coding agent through reading regulation documents, understanding the developer user's business context, configuring model tiers and thresholds, and establishing the working relationship. Covers initial conversation with developer user to scope the verification task, set expectations, and agree on checkpoints.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: compliance-judgment
3
+ tier: meta
3
4
  description: Determine whether extracted entities comply with verification rules. Use after entity extraction to make the pass/fail judgment for each rule on each document. Covers translating natural language rules into executable logic, choosing between Python calculation and LLM semantic judgment, and producing actionable comments on failures. Also use when designing the judgment step of a workflow or when a rule's judgment logic needs debugging.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: confidence-system
3
+ tier: meta
3
4
  description: Design and calibrate confidence scoring for extraction and verification results. Use when building any workflow that needs to quantify trust in its output, when setting up quality control sampling thresholds, or when calibrating existing confidence scores against actual accuracy. Confidence is the bridge between workflows and quality control — high confidence means less review, low confidence means more review. Also use when the quality control skill reports that confidence scores do not correlate with actual correctness.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: corner-case-management
3
+ tier: meta
3
4
  description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use when the evolution loop classifies a failure as a corner case (affecting less than ~10% of documents), when adding a new edge case to the registry, or when deciding whether a corner case should be promoted to a systemic fix. Also use when designing the corner case detection mechanism for a workflow.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: cross-document-verification
3
+ tier: meta
3
4
  description: Perform case-level analysis across multiple documents for the same transaction. Use when documents do not exist in isolation — main contracts have appendices, loan applications come bundled with income certificates, bank statements, credit reports, and property appraisals. Use to build comparison matrices, detect contradictions (hard mismatches and soft implausibilities), classify severity, and flag fraud signals. Also use when user or end-user reports a cross-document inconsistency — these reports are ground truth and take priority over agent judgment.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: dashboard-reporting
3
+ tier: meta-meta
3
4
  description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: data-sensibility
3
+ tier: meta
3
4
  description: Build intuition about document data before writing extraction logic. Use before designing any extraction schema or regex pattern, when onboarding a new document type, or when extraction accuracy is unexpectedly low and you suspect a data assumption is wrong. Covers systematic observation of raw documents, spot-checking extracted results, distribution analysis, and recognizing suspicious patterns. If you are about to write code that touches document data and you have not read at least five documents end-to-end, stop and use this skill first.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: document-chunking
3
+ tier: meta
3
4
  description: >
4
5
  Fast, cheap chunking for processing batches of sample and input documents.
5
6
  Use when you need to split documents into manageable pieces for initial observation,
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: document-parsing
3
+ tier: meta
3
4
  description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: entity-extraction
3
+ tier: meta
3
4
  description: Extract specific entities, values, and text segments from documents as required by verification rules. Use after tree processing has located the relevant section, when a rule needs a specific number, date, name, amount, clause, or any domain-specific entity extracted. Covers extraction method selection (regex vs LLM), schema design, postprocessing, and confidence annotation. Also use when designing the extraction step of a workflow for worker LLMs.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: evolution-loop
3
+ tier: meta-meta
3
4
  description: Drive continuous improvement of skills and workflows through the diagnose-classify-fix-retest cycle. Use after any testing round reveals failures, when production quality control flags issues, or when accuracy drops below thresholds. Covers failure analysis, distinguishing systemic issues from corner cases, deciding whether to rewrite or patch, and knowing when to stop iterating. The evolution loop is the heartbeat of the system. Also use when transitioning between lifecycle phases (skill testing, workflow testing, production monitoring).
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: pdf-review-dashboard
3
+ tier: meta
3
4
  description: >
4
5
  Generate a two-column PDF review dashboard for manual verification result checking.
5
6
  Left panel shows the original PDF document, right panel shows verification results.
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: quality-control
3
+ tier: meta-meta
3
4
  description: Design and execute quality control for production verification workflows. Use when workflows are deployed on Input/ documents and results need to be monitored, when designing the QC sampling strategy for a rule, or when evaluating whether monitoring can be reduced. Covers LLM-as-Judge evaluation, adaptive sampling strategies, confidence-based triage, and the transition from active monitoring to stable oversight. Also use when production quality drops and you need to diagnose whether to trigger the evolution loop.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: rule-extraction
3
+ tier: meta
3
4
  description: Extract and organize business verification rules from regulation documents into discrete, testable units. Use when processing documents in Rules/ to identify individual verification rules, when decomposing a regulation into atomic checks, or when the developer user adds new regulation files. Covers reading regulation text, identifying rule boundaries, determining granularity, handling cross-references, and producing a rule catalog. Also use when rules are provided in structured formats like xlsx or csv.
4
5
  ---
5
6
 
@@ -133,6 +134,65 @@ conversation or existing catalog. Therefore, when composing the brief:
133
134
  catalog.json.** rule_catalog uses workspace file locking;
134
135
  sandbox_exec bypasses it and races with other writers.
135
136
 
137
+ ## How to read regulation files (default: read whole)
138
+
139
+ Regulations are the audit's authoritative basis. Every `source_ref`
140
+ in your extracted rules must be verifiable against the source text.
141
+ For typical regulation documents (a single file under ~50 KB / under
142
+ ~100 pages), **read each regulation file whole using `workspace_file`
143
+ (operation=read) in a single call**:
144
+
145
+ ```js
146
+ workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_regulation.md" })
147
+ ```
148
+
149
+ `workspace_file.read` is capped at 50,000 chars per call, which
150
+ covers virtually every individual regulation document. This is the
151
+ default. **Read every regulation file whole before you start
152
+ extracting rules from any of them.**
153
+
154
+ ### Tool choice — `workspace_file` vs `sandbox_exec`
155
+
156
+ | Tool | Per-call cap | Use for |
157
+ |---|---:|---|
158
+ | `workspace_file` (read) | 50,000 chars | **full reads of regulation / rule documents** |
159
+ | `sandbox_exec` (cat/head/etc) | 10,000 chars | shell commands, **not** full file reads |
160
+
161
+ `sandbox_exec` is designed for shell commands; its 10K cap is too
162
+ small for most regulations. `cat rules/01_*.md` returns only the
163
+ first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` /
164
+ `tail -M` to scroll the window loses positional precision and burns
165
+ turns. **When you see truncation, don't fight the cap — switch
166
+ tools.**
167
+
168
+ ### Asymmetry — regs read whole, samples sampled
169
+
170
+ Regulations are limited (typically 1-10 files), authoritative, and
171
+ read once. Read every regulation whole.
172
+
173
+ Sample documents may number 30 to 1000+, are heterogeneous, and get
174
+ read many times during testing. **Don't try to read every sample
175
+ whole.** Use rule-applicability filters or sampled subsets to focus
176
+ attention.
177
+
178
+ ### Escape valve — when a single reg exceeds ~200K chars
179
+
180
+ Rare in practice. The largest regulation in `test_data_4` is 42 KB;
181
+ typical Chinese banking regs (资管新规, 信披办法, etc.) all fit
182
+ under 50 KB. But if you do encounter a single regulation so large
183
+ that reading it whole would crowd the context window — heuristic:
184
+ the file exceeds ~200,000 chars or ~25% of your context budget —
185
+ use your own judgment:
186
+
187
+ - Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse`
188
+ or paginated `workspace_file` reads
189
+ - Or build an in-workspace index file pointing to chapter offsets and
190
+ read on-demand per rule being extracted
191
+
192
+ The 50 KB cap is high enough that this almost never triggers. **The
193
+ default is read whole; deviate only when the file genuinely doesn't
194
+ fit.**
195
+
136
196
  ## Extraction Strategies
137
197
 
138
198
  ### Strategy 1: Structured Input (Developer User Provides Rules)
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: rule-graph
3
+ tier: meta-meta
3
4
  description: Build and maintain a graph of relationships between verification rules — shared entities, logical dependencies, and conflicts. Use when analyzing the impact of a regulation change, when optimizing extraction to avoid duplicate work, when checking rule catalog completeness, or when rolling up document-level results into a summary. Critical constraint — the graph is an overlay for analysis, NOT a prerequisite for execution. Every rule must remain independently runnable.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: skill-authoring
3
+ tier: meta
3
4
  description: Write each verification rule into a Claude Code skill folder following the official skill format. Use when converting extracted rules into skill folders, when iterating on existing rule skills after testing, or when the developer user wants to capture domain knowledge as a skill. Each skill folder must be self-contained with business logic in SKILL.md, code in scripts/, regulation context in references/, and sample data in assets/. Also use the bundled skill-creator for the full eval/iterate workflow.
4
5
  ---
5
6
 
@@ -1,6 +1,7 @@
1
1
  ---
2
2
  name: skill-creator
3
- description: Anthropic's skill-scaffolding toolkit — use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, read `meta-meta/skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `meta-meta/work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
3
+ tier: meta
4
+ description: Anthropic's skill-scaffolding toolkit — use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, consult `skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
4
5
  ---
5
6
 
6
7
  # Skill Creator
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: skill-to-workflow
3
+ tier: meta
3
4
  description: Distill a proven verification skill into a Python workflow with worker LLM prompts. Use when a rule skill has been tested and reaches the SKILL_ACCURACY threshold defined in .env. Covers the decision of what to implement as code vs LLM calls, prompt engineering for small context windows, model tier selection and progressive downgrade, and testing workflows against the coding agent's own results as ground truth. Also use when optimizing existing workflows for cost or speed.
4
5
  ---
5
6
 
@@ -49,10 +50,10 @@ Most rules are a mix: regex extracts the number, Python compares it to the thres
49
50
 
50
51
  Before declaring distillation complete, audit each rule's `verification_type` / `metric` / `evidence_type` (or equivalent fields in your catalog). For rules where the required verification is one of:
51
52
 
52
- - **Semantic** ("is this a positive guarantee or a disclaimer?")
53
- - **Contextual** ("interpret this in light of the document's product type")
54
- - **Counterfactual** ("what should this value be, given the other fields?")
55
- - **Cross-field arithmetic** ("does 期初 + 收益 - 分配 = 期末?")
53
+ - **Semantic** judgment
54
+ - **Contextual** interpretation
55
+ - **Counterfactual** reasoning
56
+ - **Cross-field arithmetic**
56
57
 
57
58
  regex alone rarely suffices. Three acceptable forms:
58
59
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: task-decomposition
3
+ tier: meta-meta
3
4
  description: Decompose each verification rule into independent sub-tasks and assign the optimal method (rule, code, LLM, manual) to each. Use when converting extracted rules into implementation plans, when a rule skill is too expensive or inaccurate and needs restructuring, or when designing a multi-step verification pipeline. Covers MECE decomposition, method selection via the four-dimension decision matrix, cost-benefit analysis, and source tagging. Also use when auditing an existing workflow for cost optimization opportunities.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: tree-processing
3
+ tier: meta
3
4
  description: >
4
5
  Design production-grade document chunking mechanisms for verification workflows. Use when
5
6
  building the chunking step of a workflow that will run repeatedly on many documents.
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: version-control
3
+ tier: meta
3
4
  description: Manage versioning of skills, workflows, prompts, and system configuration throughout the lifecycle. Use when skills are modified, workflows are regenerated, prompts are updated, or any artifact needs rollback capability. Covers what to version, how to version with file-system conventions, maintaining a version manifest, and rollback procedures. Also use when comparing performance between versions or when production results need to trace back to the exact workflow version that produced them.
4
5
  ---
5
6
 
@@ -1,5 +1,6 @@
1
1
  ---
2
2
  name: work-decomposition
3
+ tier: meta-meta
3
4
  description: Decide how to decompose the rule set into TaskBoard tasks during rule_extraction → skill_authoring transition. Covers ordering methodologies (difficulty-first / Shannon–Huffman, breadth-first, depth-first, binary partition), grouping rules (when to bundle multiple rules into one task vs. keep separate), three-axis difficulty estimation, and how to write PATTERNS.md project memory that stays useful across the run. Use when entering rule_extraction, when entering skill_authoring, or whenever the TaskBoard feels wrong and you want to re-decompose.
4
5
  ---
5
6
 
@@ -7,7 +8,7 @@ description: Decide how to decompose the rule set into TaskBoard tasks during ru
7
8
 
8
9
  KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py becomes the unified-runner anti-pattern from E2E #4. If related rules are split across separate skills, the agent re-derives the shared chunker logic 17 times.
9
10
 
10
- This skill is the conductor's playbook for that decision. It ships under `meta-meta/` because work decomposition is a system-level discipline, not a per-rule technique. The complementary `task-decomposition` skill (also under `meta-meta/`) covers the *internal* structure of one rule's check — locate, extract, normalize, judge, comment. This skill covers how the rule **set** should be split into TaskBoard items.
11
+ This skill is the conductor's playbook for that decision. It's tagged `tier: meta-meta` because work decomposition is a system-level discipline, not a per-rule technique. The complementary `task-decomposition` skill (also `tier: meta-meta`) covers the *internal* structure of one rule's check — locate, extract, normalize, judge, comment. This skill covers how the rule **set** should be split into TaskBoard items.
11
12
 
12
13
  ## When to use this skill
13
14
 
@@ -85,7 +86,7 @@ Bundle multiple rules into a single task (and a single check_r###_r###.py file)
85
86
  - The judgment logic for one rule is a substring or close variant of the next
86
87
  - A single failure typically implies multiple failures (you can't pass R013 if R015 fails)
87
88
 
88
- Example: R013 / R015 / R017 all check that a specific table on page 3 of the report contains certain mandatory fields. Same chunk, same parse, same verdict shape. Bundle as `check_r013_r015_r017.py` and create a single TaskCreate task `R013/R015/R017 — required-fields table`. The engine's filesystem-derived milestones recognize the grouped check.py and credit all three rule_ids.
89
+ Example: R013 / R015 / R017 all check that a specific table on page 3 of the report contains certain mandatory fields. Same chunk, same parse, same verdict shape. Bundle as `check_r013_r015_r017.py` and create a single task: `TaskCreate({id: "R013-R015-R017-skill_authoring", title: "R013/R015/R017 — required-fields table", phase: "skill_authoring"})`. The engine's filesystem-derived milestones recognize the grouped check.py and credit all three rule_ids.
89
90
 
90
91
  ### When to keep separate
91
92
 
@@ -344,6 +345,40 @@ When entering skill_authoring with an empty TaskBoard:
344
345
  5. **Pick the first task.** Work it to completion (skill + check + at least one local test). Update PATTERNS.md with whatever you learned. Move to the next task.
345
346
  6. **At task ~5 and task ~10:** stop and re-read PATTERNS.md. If patterns suggest a refactor of earlier work, do it now (cheap) rather than later (expensive).
346
347
 
348
+ ### Calling TaskCreate / TaskUpdate / TaskComplete
349
+
350
+ The engine registers three task-board tools (v0.7.4):
351
+
352
+ - `TaskCreate({id, title, phase, ruleId?})` — adds a task to `tasks.json`. `id` must be unique within the session; pick a stable shape like `<rule_id>-<phase>` for per-rule tasks or `<group-name>-<phase>` for grouped / non-rule tasks. `phase` is the current phase the task belongs to. `ruleId` is optional — set it for per-rule tasks so the engine can credit the rule_id in milestone derivation.
353
+ - `TaskUpdate({id, status?, summary?})` — change a task's status (`pending` / `in_progress` / `completed` / `failed`), optionally with a short summary.
354
+ - `TaskComplete({id, summary?})` — sugar for `TaskUpdate({id, status:"completed", summary})`. Use this after finishing a unit of work.
355
+
356
+ ### Ralph loop scope — within a phase only
357
+
358
+ Important contract (changed in v0.7.4 after team feedback):
359
+
360
+ - **Loop scope = current phase only.** TaskCreate populates tasks for the CURRENT phase. The Ralph loop processes them one by one within the phase.
361
+ - **Loop exits at phase boundaries.** When all current-phase tasks complete OR the phase advances (you call `phase_advance`, or anything else changes `currentPhase`), the loop exits cleanly. Control returns to the user.
362
+ - **No engine auto-advance.** The engine does NOT auto-advance phases when tasks complete + exit criteria are met. Phase advance is YOUR explicit call (`phase_advance` tool) or the user's re-prompt.
363
+ - **Don't pre-create tasks for future phases.** They'll be ignored — the loop exits at the phase boundary before processing them. Create tasks only for the phase you're currently in.
364
+ - **Phase boundaries = user checkpoints.** This is intentional. The team needs visibility into progress at natural breakpoints. After your task batch + `phase_advance`, the loop exits, you summarize progress in your final message, the user prompts you to begin the next phase.
365
+
366
+ End-to-end autonomous "run from bootstrap to finalization without stopping" is NOT the engine's job — when that capability ships, it'll be an external driver (`/loop`-style command) that calls the agent repeatedly across phases. Inside one invocation, work the current phase fully, advance, and return to the user.
367
+
368
+ Examples:
369
+
370
+ ```
371
+ TaskCreate({ id: "R001-skill_authoring", title: "Author skill for R001",
372
+ phase: "skill_authoring", ruleId: "R001" })
373
+
374
+ TaskCreate({ id: "trust-bundle-skill_authoring",
375
+ title: "R013/R015/R017 — required-fields table",
376
+ phase: "skill_authoring" })
377
+
378
+ TaskComplete({ id: "R001-skill_authoring",
379
+ summary: "regex check passes 89/90; R001 done" })
380
+ ```
381
+
347
382
  ### Persisted methodology — PATTERNS.md OR phase logs OR AGENT.md decisions
348
383
 
349
384
  The principle: capture framework-level decisions to disk before each phase advance. The conversation will compact, agents will restart, the next phase will lose grounding. Whichever format you pick, write to disk — don't rely on conversation context that disappears.