kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/package.json +1 -1
  2. package/src/agent/context.js +17 -1
  3. package/src/agent/engine.js +85 -8
  4. package/src/agent/llm-client.js +24 -1
  5. package/src/agent/pipelines/_milestone-derive.js +78 -7
  6. package/src/agent/pipelines/skill-authoring.js +19 -2
  7. package/src/agent/tools/release.js +94 -1
  8. package/src/cli/index.js +28 -7
  9. package/template/.env.template +1 -1
  10. package/template/AGENT.md +2 -2
  11. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  12. package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
  13. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  14. package/template/skills/en/confidence-system/SKILL.md +30 -8
  15. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  16. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  17. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  18. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  19. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  20. package/template/skills/en/document-chunking/SKILL.md +99 -15
  21. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  22. package/template/skills/en/quality-control/SKILL.md +14 -0
  23. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  24. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  25. package/template/skills/en/skill-authoring/SKILL.md +52 -8
  26. package/template/skills/en/skill-creator/SKILL.md +25 -3
  27. package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
  28. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  29. package/template/skills/en/tree-processing/SKILL.md +1 -1
  30. package/template/skills/en/version-control/SKILL.md +15 -0
  31. package/template/skills/en/work-decomposition/SKILL.md +21 -35
  32. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  33. package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
  34. package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
  35. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  36. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  37. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  38. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  39. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  40. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  41. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  42. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  43. package/template/skills/zh/document-chunking/SKILL.md +96 -20
  44. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  45. package/template/skills/zh/entity-extraction/SKILL.md +14 -4
  46. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  47. package/template/skills/zh/quality-control/SKILL.md +14 -0
  48. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  49. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  50. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  51. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  52. package/template/skills/zh/skill-authoring/SKILL.md +108 -69
  53. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  54. package/template/skills/zh/skill-creator/SKILL.md +71 -61
  55. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  56. package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
  57. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  58. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  59. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  60. package/template/skills/zh/tree-processing/SKILL.md +1 -1
  61. package/template/skills/zh/version-control/SKILL.md +15 -0
  62. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  63. package/template/skills/zh/work-decomposition/SKILL.md +21 -33
@@ -1,40 +1,40 @@
1
- # Parser Catalog
1
+ # 解析器目录
2
2
 
3
- ## Text-Based Parsers (No LLM Required)
3
+ ## 文本类解析器(无需 LLM
4
4
 
5
- | Parser | Type | Strengths | Limitations | Install |
5
+ | 解析器 | 类型 | 优势 | 局限 | 安装 |
6
6
  |--------|------|-----------|-------------|---------|
7
- | PyMuPDF (fitz) | Text extraction | Fast, reliable, basic structure | No table awareness, no OCR | `pip install pymupdf` |
8
- | pdfplumber | Layout-aware | Good table detection, spatial layout | Text-only, no OCR | `pip install pdfplumber` |
9
- | python-docx | DOCX parser | Native DOCX support, preserves structure | DOCX only | `pip install python-docx` |
10
- | openpyxl | XLSX parser | Full spreadsheet support | XLSX only | `pip install openpyxl` |
11
- | MarkItDown | Multi-format | Handles PDF, DOCX, PPTX, XLSX → markdown | Basic parsing, may miss complex layouts | `pip install markitdown` |
7
+ | PyMuPDF (fitz) | 文本抽取 | 快、稳定、基础结构识别 | 不识别表格、不支持 OCR | `pip install pymupdf` |
8
+ | pdfplumber | 版面感知 | 表格检测良好,保留空间布局 | 仅文本,不支持 OCR | `pip install pdfplumber` |
9
+ | python-docx | DOCX 解析器 | 原生支持 DOCX,保留结构 | 仅支持 DOCX | `pip install python-docx` |
10
+ | openpyxl | XLSX 解析器 | 完整支持电子表格 | 仅支持 XLSX | `pip install openpyxl` |
11
+ | MarkItDown | 多格式 | 处理 PDFDOCXPPTXXLSX → markdown | 解析较基础,复杂版面可能丢失 | `pip install markitdown` |
12
12
 
13
- ## OCR / Vision Models (Via SiliconFlow API)
13
+ ## OCR / 视觉模型(通过 SiliconFlow API
14
14
 
15
- | Model | Tier | Strengths | Best For |
15
+ | 模型 | 等级 | 优势 | 最适合 |
16
16
  |-------|------|-----------|----------|
17
- | zai-org/GLM-4.6V | OCR_TIER1 | Best accuracy, strong Chinese OCR | Complex tables, mixed layouts |
18
- | Qwen/Qwen3.5-397B-A17B | OCR_TIER2 | Good general vision, large model | Tables with context-dependent interpretation |
19
- | PaddlePaddle/PaddleOCR-VL-1.5 | OCR_TIER3 | Fast, lightweight | Standard text, simple tables |
17
+ | zai-org/GLM-4.6V | OCR_TIER1 | 准确率最高,中文 OCR | 复杂表格、混合版面 |
18
+ | Qwen/Qwen3.5-397B-A17B | OCR_TIER2 | 通用视觉能力好,模型规模大 | 需要结合上下文理解的表格 |
19
+ | PaddlePaddle/PaddleOCR-VL-1.5 | OCR_TIER3 | 快、轻量 | 标准文本、简单表格 |
20
20
 
21
- ## Local Deployment Options
21
+ ## 本地部署选项
22
22
 
23
- For developer users who prefer local processing:
23
+ 适合偏好本地处理的开发者用户:
24
24
 
25
- | Tool | Type | Notes |
25
+ | 工具 | 类型 | 备注 |
26
26
  |------|------|-------|
27
- | PaddleOCR | Local OCR | Open source, supports Chinese/English |
28
- | Surya | Local OCR | Modern OCR with table detection |
29
- | pdf2md-local | PDF → Markdown | Reference: github.com/Ruilin-mmwa/pdf2md-local |
27
+ | PaddleOCR | 本地 OCR | 开源,支持中英文 |
28
+ | Surya | 本地 OCR | 现代 OCR,支持表格检测 |
29
+ | pdf2md-local | PDF → Markdown | 参考:github.com/Ruilin-mmwa/pdf2md-local |
30
30
 
31
- ## Selection Decision Tree
31
+ ## 选型决策树
32
32
 
33
33
  ```
34
- Is the PDF text-based (not scanned)?
35
- ├─ Yes → PyMuPDF or pdfplumber
36
- │ └─ Are tables parsed correctly?
37
- │ ├─ YesDone
38
- │ └─ NoTry pdfplumber → If still bad Vision model on table regions
39
- └─ No (scanned) → OCR_TIER3 → If quality insufficient → OCR_TIER1
34
+ PDF 是文本型(非扫描件)吗?
35
+ ├─ → PyMuPDF pdfplumber
36
+ │ └─ 表格解析正确吗?
37
+ │ ├─ 完成
38
+ │ └─ 改用 pdfplumber → 仍不理想对表格区域使用视觉模型
39
+ └─ 否(扫描件) → OCR_TIER3 → 质量不足 → OCR_TIER1
40
40
  ```
@@ -38,11 +38,9 @@ description: Extract specific entities, values, and text segments from documents
38
38
 
39
39
  ### 可用方法
40
40
 
41
- **正则 / Python** —— 成本:零。速度:即时。结果确定。
42
- 适用场景:日期、金额、百分比、标识符、固定短语、编号、电话、地址等任何格式可预测的值。任何能写出清晰格式约束的字段,都应该优先考虑正则。
41
+ **正则 / Python** —— 成本:零。速度:即时。结果确定。适用场景:日期、金额、百分比、标识符、固定短语、编号、电话、地址等任何格式可预测的值。任何能写出清晰格式约束的字段,都应该优先考虑正则。
43
42
 
44
- **Worker LLM** —— 成本:API tokens。速度:秒级。具备语义理解能力。
45
- 适用场景:需要结合上下文判断、条件性取值、语义匹配、结构模糊、识别误导性或暗示性表述、表格语义解读,凡是依赖理解而非模式匹配的任务。Worker LLM 在表面形式不稳定但语义清晰的场景下尤其有价值。
43
+ **Worker LLM** —— 成本:API tokens。速度:秒级。具备语义理解能力。适用场景:需要结合上下文判断、条件性取值、语义匹配、结构模糊、识别误导性或暗示性表述、表格语义解读,凡是依赖理解而非模式匹配的任务。Worker LLM 在表面形式不稳定但语义清晰的场景下尤其有价值。
46
44
 
47
45
  实际验证任务中存在大量需要语义理解的场景——"这段描述是否具有误导性?"、"该条款是否充分披露风险?"、"该担保人的业务描述是否与其所述行业一致?"、"产品类型表述与底层资产是否匹配?"——这些都不是正则能处理的问题。遇到此类任务,毫不犹豫地使用 worker LLM;不要为了节省 tokens 而把不适合的任务硬塞给正则,否则就是用低成本换高漏报或高误报,最终在审计或复核环节付出更大代价。
48
46
 
@@ -119,3 +117,15 @@ Schema 通常需要包含以下信息:
119
117
  3. 若目标章节超出可用上下文,回到树处理进一步收窄,或将其切分为多次调用。
120
118
  4. 始终为模型的响应预留足够空间,否则可能在生成中途被截断,导致 JSON 不完整。
121
119
  5. 用真实使用的模型做端到端测试以验证上下文确实能装下——编码 agent 估算的 token 数可能与 worker LLM 自己的分词器结果不一致,尤其是在中文、表格与代码混排的场景下,差异可能达到数十个百分点,仅凭估算容易在生产环境上线后才发现窗口被打爆。
120
+
121
+ ## 抽取也有边缘案例
122
+
123
+ 抽取**和判断同样重要**,对最终准确率的贡献不可低估。一个跨项目的经验:超过一半的最终错误其实可以追溯到抽取问题,而非判断问题 —— 抽取器返回了错值、错单位、或从错的章节取了内容,判官则忠实地从错的输入里得出了错的 verdict。
124
+
125
+ 把抽取按和判断同样的迭代纪律来做:
126
+
127
+ - **反思 / 迭代**:在样本集上跑过一次抽取器之后,回看失败的 case。是漏了某种模式(往 prompt 或正则里补)?是格式怪癖(单位换算、本地化)?还是文档类型问题(抽取器对 A 类对、对 B 类错)?
128
+ - **边缘案例登记**:当一个抽取失败没办法以合理代价改进标准抽取器时,把它登记到 `corner-case-management` 里 —— 注册表形状和判断边缘案例一样,resolution 类型换成 `code` / `prompt` / `parser` 级别的转换即可。
129
+ - **独立验证抽取器**:只在判断侧失败的端到端测试可能掩盖一个差劲的抽取器 —— 它的输出虽然不准,但碰巧 *大部分时候* 让判官得出了正确的 verdict。在 QC 复核里抽查的应当是抽取值本身,不只是最终 verdict。
130
+
131
+ 当你想通过调判官的 prompt 来提升准确率时,先检查抽取器是不是给判官喂了正确的输入。更便宜、更耐久的修复点几乎总在抽取器里。
@@ -1,62 +1,62 @@
1
- # Convergence Guide
1
+ # 收敛诊断指南
2
2
 
3
- Diagnostic procedures and real-world data for understanding when the evolution loop is converging, stalling, or regressing.
3
+ 用于判断演化循环是在收敛、停滞还是回退的诊断流程与真实数据。
4
4
 
5
- ## Empirical Data
5
+ ## 实证数据
6
6
 
7
- ### The Shiji Project — Event Dating Reflection
7
+ ### 史记项目——历史事件日期回顾
8
8
 
9
- A document verification project for historical event dating across regulatory filings. Five rounds of evolution:
9
+ 一个针对监管报送文档进行历史事件日期核查的项目。共进行了五轮演化:
10
10
 
11
- - **Round 1**: 1,010 corrections (first pass — many extraction and judgment errors across the board).
12
- - **Round 2**: 431 corrections (systematic fixes applied — regex patterns, prompt refinements).
13
- - **Round 3**: 465 corrections (regression — round 2 fix for date normalization introduced new failures on edge-case date formats).
14
- - **Round 4**: 167 corrections (stabilizing — round 3 regression diagnosed and resolved, remaining issues are corner cases).
15
- - **Round 5**: 46 corrections (converged — below 5% threshold, no new patterns, no regressions).
11
+ - **第 1 轮**:1,010 项修正(首轮——抽取与判定环节都有大量错误)。
12
+ - **第 2 轮**:431 项修正(系统性修复——调整正则模式、改写提示词)。
13
+ - **第 3 轮**:465 项修正(回退——第 2 轮的日期标准化修复在边界日期格式上引发了新的失败)。
14
+ - **第 4 轮**:167 项修正(趋于稳定——第 3 轮的回退已诊断并解决,剩余问题为长尾边界情况)。
15
+ - **第 5 轮**:46 项修正(已收敛——低于 5% 阈值,无新模式,无回退)。
16
16
 
17
- **Key insight**: The round 3 spike was the most informative event. It revealed that the round 2 fix was too aggressive — it normalized dates that should not have been normalized. Without convergence tracking, this regression might have been masked by overall accuracy still improving on other cases.
17
+ **核心洞察**:第 3 轮的反弹是最有价值的事件。它揭示第 2 轮的修复过激——把本不该标准化的日期也强行标准化了。如果不做收敛追踪,这一回退可能被整体准确率的提升所掩盖。
18
18
 
19
- ## Diagnostic Flowchart
19
+ ## 诊断流程图
20
20
 
21
- ### If correction volume increases between iterations:
21
+ ### 如果两次迭代之间的修正数量增加:
22
22
 
23
- 1. **Check for regression**: Are previously passing cases now failing? If yes, the last fix is the likely cause. Compare the diff between iterations.
24
- 2. **Check for fix conflicts**: Does the new fix contradict a prior fix? For example, broadening a regex in round N that was narrowed in round N-1.
25
- 3. **Check for test set changes**: Did new documents enter the test set between iterations? New documents can inflate correction volume without indicating regression.
23
+ 1. **检查是否出现回退**:是否有原本通过的案例现在不通过了?若是,则上一次修复很可能是元凶。比对两次迭代之间的 diff
24
+ 2. **检查修复之间是否冲突**:新的修复是否与之前的修复矛盾?例如第 N 轮放宽了在第 N-1 轮收紧的正则。
25
+ 3. **检查测试集是否变化**:两次迭代之间是否新增了文档?新文档可能在不代表回退的情况下抬高修正数量。
26
26
 
27
- ### If correction volume stays flat (not decreasing):
27
+ ### 如果修正数量持平(未下降):
28
28
 
29
- 1. **Check for oscillation**: Are the same cases flipping between pass and fail across iterations? This indicates the fix is unstable — it solves one variant but breaks another.
30
- 2. **Check if fix is too narrow**: The fix addresses the specific failing cases but does not generalize. The next iteration reveals similar cases the fix missed.
29
+ 1. **检查是否振荡**:是否有相同案例在不同迭代之间反复在通过与不通过之间切换?这说明修复不稳定——解决了一种变体却破坏了另一种。
30
+ 2. **检查修复是否过窄**:修复仅命中了具体失败的几条用例,没有泛化能力。下一次迭代会暴露出修复未覆盖的相似用例。
31
31
 
32
- ## False Convergence
32
+ ## 虚假收敛
33
33
 
34
- Metrics look stable but underlying issues are masked. The system appears converged but will fail on production data.
34
+ 指标看起来稳定,但底层问题被掩盖了。系统貌似已收敛,可一上生产数据就出问题。
35
35
 
36
- ### Common Causes
36
+ ### 常见成因
37
37
 
38
- - **Test set too small**: With fewer than 20 test cases, a single case changing can swing metrics by 5%. Convergence at this scale is statistically meaningless.
39
- - **Test set does not cover production variety**: The test set was curated from "clean" examples. Production documents include scanned PDFs, handwritten annotations, multi-language content, and formatting variations the test set never saw.
40
- - **Corner cases excluded from metrics**: If difficult cases are moved to `corner_cases.json` and excluded from accuracy calculation, the remaining "easy" cases converge quickly but the real problem is hidden.
38
+ - **测试集过小**:测试用例少于 20 条时,单条用例的状态切换就能让指标波动 5%。在这种规模下的收敛在统计上没有意义。
39
+ - **测试集未覆盖生产环境的多样性**:测试集是从"干净"样本中精选出来的。而生产文档包含扫描件 PDF、手写批注、多语种内容、各种格式变体——这些都未出现在测试集中。
40
+ - **难例被从指标中剔除**:如果将难例移入 `corner_cases.json` 并从准确率统计中排除,剩余的"简单"用例会很快收敛,但真正的问题被隐藏了。
41
41
 
42
- ### Detection
42
+ ### 检测方法
43
43
 
44
- Compare test set distribution to production distribution on key dimensions: document type, length, format, source. If they diverge significantly, convergence on the test set does not guarantee production quality.
44
+ 在文档类型、长度、格式、来源等关键维度上对比测试集与生产数据的分布。若两者偏差显著,测试集上的收敛并不能保证生产质量。
45
45
 
46
- ## Estimating Remaining Rounds
46
+ ## 估计剩余轮次
47
47
 
48
- ### Simple Heuristic
48
+ ### 简单启发式
49
49
 
50
- If corrections approximately halve each round, expect `log2(current_corrections / threshold)` more rounds.
50
+ 如果每轮修正数大致减半,预期还需要 `log2(当前修正数 / 阈值)` 轮。
51
51
 
52
- Example: current round has 200 corrections, threshold is 5% of 1000 cases = 50 corrections.
53
- - Estimated remaining rounds: log2(200/50) = log2(4) = 2 rounds.
52
+ 示例:当前轮有 200 项修正,阈值为 1000 个案例的 5% = 50 项修正。
53
+ - 预计剩余轮次:log2(200/50) = log2(4) = 2 轮。
54
54
 
55
- ### When the Heuristic Fails
55
+ ### 启发式失效时
56
56
 
57
- If corrections do not halve between rounds, the current approach may have hit its ceiling. Consider:
58
- - Escalating the fix strategy (prompt tweak logic rewrite architecture change).
59
- - Expanding the test set to reveal hidden patterns.
60
- - Consulting the developer user for domain insight on stubborn failures.
57
+ 如果每轮修正数没有减半,意味着当前方法可能已经触及天花板。可考虑:
58
+ - 升级修复策略(提示词微调逻辑重写架构调整)。
59
+ - 扩充测试集以暴露隐藏模式。
60
+ - 就顽固失败案例向开发者用户咨询领域见解。
61
61
 
62
- Do not grind through more iterations expecting different results. If three consecutive rounds show similar correction volumes, stop and reassess.
62
+ 不要寄望于继续硬磨多轮能换来不同结果。如果连续三轮的修正数都接近,停下来重新评估。
@@ -12,6 +12,20 @@ description: Design and execute quality control for production verification work
12
12
 
13
13
  质量监控的角色是「观察员」:用最少的复查量,维持对系统准确率的信心。当信心下降时,立即拉响警报、触发演化循环。
14
14
 
15
+ ## 与其他 skill 的协作
16
+
17
+ 质量监控是一组紧密协作的 skill 中的一员。不要把兄弟 skill 的内容搬过来在这里复述 —— 引用它即可。同一阶段同时加载的 skill 对 conductor 已经可见,在本 skill 里再注入一遍它们的材料,只会把两边都撑胖。
18
+
19
+ 各自的关系:
20
+
21
+ - `confidence-system` 定义置信度怎么合成、怎么校准。当 QC 用置信度来分流"哪些结果需要更多复核"时,它**消费**置信度 —— 但置信度的设计归在那边。
22
+ - `evolution-loop` 是把 QC 发现转化为改进的闭环机器。QC 产出信号(失败、漂移、反复出现的模式);evolution-loop 决定怎么处理这些信号。
23
+ - `corner-case-management` 是 QC 发现的异常的归宿。QC 揭示"这一份没合上";corner-case-management 判断它是该登记为边缘案例、还是系统性问题该上升到主流程、或者是数据质量问题需要升级。
24
+ - `cross-document-verification` 是另一类规则。QC 的工作是核查那类规则是否按设计在执行,而不是再讲一遍怎么构建它们。
25
+ - `dashboard-reporting` 是 QC 结果向开发者用户呈现的地方。QC 产数据,dashboard 来渲染。
26
+
27
+ 写作意义:如果你发现自己在本文件里写的东西更自然地归属于上面某个 skill,就在这里留一句指向("置信度的合成见 `confidence-system`"),把深度留在该去的地方。conductor 需要细节时,那个 skill 已经为它加载了。
28
+
15
29
  ## 五层质量保障架构
16
30
 
17
31
  质量控制不是单一活动——它由五个层级构成,逐层递进。低层级必须通过后,高层级才会执行。
@@ -1,92 +1,92 @@
1
- # QA Layer Specifications
1
+ # QA 层级规格
2
2
 
3
- Detailed specifications for the five-layer QA architecture. Each layer builds on the one below it.
3
+ 五层质量保障架构的详细规格。每一层都建立在下一层之上。
4
4
 
5
- ## Layer Details
5
+ ## 层级细节
6
6
 
7
- ### L1: Text Integrity
7
+ ### L1:文本完整性
8
8
 
9
- - **Description**: Verify that source files exist, are readable, and that text content is preserved correctly after any processing (parsing, OCR, conversion).
10
- - **Input**: Raw document files and their processed text output.
11
- - **Output**: Pass/fail per file with error details.
12
- - **Example checks**: File exists and is non-empty. Encoding is UTF-8 (or declared encoding). No null bytes in text output. Character count is within expected range for document type.
13
- - **Common failures**: File path changed after processing. OCR produced empty output. Encoding mismatch causes garbled characters.
14
- - **Escalation**: If L1 fails, do not proceed to higher layers. Log the failure and flag for reprocessing.
9
+ - **描述**:核查源文件存在、可读,且在经过任何处理(解析、OCR、转换)之后文本内容保持正确。
10
+ - **输入**:原始文档文件及其处理后的文本输出。
11
+ - **输出**:每个文件一份通过/不通过结论,附错误细节。
12
+ - **示例检查**:文件存在且非空;编码为 UTF-8(或声明的编码);文本输出中无空字节;字符数在该文档类型的预期范围内。
13
+ - **常见失败**:处理后文件路径变了;OCR 输出为空;编码不匹配导致乱码。
14
+ - **升级处理**:若 L1 不通过,不要进入更高层级。记录失败原因并标记需重新处理。
15
15
 
16
- ### L2: Syntax
16
+ ### L2:语法
17
17
 
18
- - **Description**: Verify that output files conform to their declared format and schema.
19
- - **Input**: Output files (JSON, CSV, etc.) from workflows.
20
- - **Output**: Pass/fail per file with parse errors or schema violations.
21
- - **Example checks**: JSON is valid (parses without error). Required top-level keys exist. Array fields are arrays, not strings. Date fields match ISO 8601 format.
22
- - **Common failures**: Trailing comma in JSON. Missing closing bracket. CSV with inconsistent column count. Unexpected null where value is required.
23
- - **Escalation**: Syntax failures indicate a bug in the output generation code. Fix the code, not the data.
18
+ - **描述**:核查输出文件是否符合声明的格式和 schema
19
+ - **输入**:工作流产出的输出文件(JSONCSV 等)。
20
+ - **输出**:每个文件一份通过/不通过结论,附解析错误或 schema 违例。
21
+ - **示例检查**:JSON 合法(解析无错);必填的顶层键存在;数组字段确为数组而非字符串;日期字段符合 ISO 8601 格式。
22
+ - **常见失败**:JSON 末尾多余的逗号;缺少右括号;CSV 列数不一致;本应有值的字段出现意外的 null
23
+ - **升级处理**:语法失败说明输出生成代码存在 bug。改代码,不要改数据。
24
24
 
25
- ### L3: Data Completeness
25
+ ### L3:数据完备性
26
26
 
27
- - **Description**: Verify that required data fields are populated with values in their valid domain.
28
- - **Input**: Parsed output records.
29
- - **Output**: Per-field validation results with reasons for any failures.
30
- - **Example checks**: Invoice date is a valid date (not "N/A" or empty). Amount is a positive number. Entity name is non-empty and does not contain only whitespace. Enum fields contain allowed values.
31
- - **Common failures**: Extraction returned "unable to determine" as a value. Amount includes currency symbol (string instead of number). Date extracted as partial (month and day but no year).
32
- - **Escalation**: Completeness failures feed back to extraction prompt improvement. If a field is consistently incomplete, the extraction logic needs work.
27
+ - **描述**:核查必填数据字段已被填充且取值在有效域内。
28
+ - **输入**:解析后的输出记录。
29
+ - **输出**:每个字段一份校验结果,失败时附原因。
30
+ - **示例检查**:发票日期是合法日期(不是 "N/A" 或空);金额是正数;实体名称非空且不仅包含空白;枚举字段取值在允许范围内。
31
+ - **常见失败**:抽取返回了 "unable to determine" 作为值;金额包含币种符号(字符串而非数字);日期抽取不完整(有月日但没有年)。
32
+ - **升级处理**:完备性失败应反馈到抽取提示词的改进中。若某字段持续不完整,则抽取逻辑需要打磨。
33
33
 
34
- ### L4: Business Logic
34
+ ### L4:业务逻辑
35
35
 
36
- - **Description**: Verify cross-field consistency and compliance with business rules.
37
- - **Input**: Complete, validated records from L3.
38
- - **Output**: Per-rule validation results with reasoning.
39
- - **Example checks**: Contract end date is after start date. Invoice date falls within contract validity period. Total amount equals sum of line items. Signatory name matches authorized personnel list.
40
- - **Common failures**: Date comparison fails due to timezone differences. Rounding errors in amount calculations. Cross-reference lookup fails because entity names differ slightly (e.g., "ABC Corp" vs "ABC Corporation").
41
- - **Escalation**: Business logic failures may indicate rule misunderstanding. Consult the developer user if the rule intent is ambiguous.
36
+ - **描述**:核查跨字段一致性及对业务规则的合规性。
37
+ - **输入**:L3 中已完整且通过校验的记录。
38
+ - **输出**:每条规则一份校验结果,附推理过程。
39
+ - **示例检查**:合同结束日期晚于开始日期;发票日期落在合同有效期内;总金额等于明细项之和;签约人姓名匹配授权人员名单。
40
+ - **常见失败**:日期比较因时区差异而失败;金额计算出现舍入误差;交叉引用查找因实体名称细微差异(如 "ABC Corp" vs "ABC Corporation")而失败。
41
+ - **升级处理**:业务逻辑失败可能意味着对规则理解有误。如果规则意图含糊,应咨询开发者用户。
42
42
 
43
- ### L5: Cross-Phase
43
+ ### L5:跨阶段
44
44
 
45
- - **Description**: Verify consistency across different phases of the verification pipeline.
46
- - **Input**: Outputs from multiple pipeline stages (extraction, verification, reporting).
47
- - **Output**: Cross-phase consistency report.
48
- - **Example checks**: Entities in final results match those in extraction output (nothing added or dropped). Rule IDs in results exist in the rule catalog. Workflow output for a skill matches the skill's own ground truth output. Confidence scores in results match those computed by the confidence system.
49
- - **Common failures**: A rule was added to the catalog but the workflow was not updated to include it. Extraction found 5 entities but results only report 4. Workflow output diverges from skill ground truth on edge cases.
50
- - **Escalation**: Cross-phase failures often indicate integration issues. Check the pipeline connections, not individual components.
45
+ - **描述**:核查核查流水线不同阶段之间的一致性。
46
+ - **输入**:流水线多个阶段的输出(抽取、核查、报告)。
47
+ - **输出**:跨阶段一致性报告。
48
+ - **示例检查**:最终结果中的实体与抽取输出一致(没有新增也没有遗漏);结果中的规则 ID 存在于规则目录中;技能对应的工作流输出与该技能自身的基准真值一致;结果中的置信度分数与置信度系统所计算的一致。
49
+ - **常见失败**:规则被加入目录但工作流未同步更新;抽取找到 5 个实体而结果中只报告 4 个;工作流输出在边界情况上与技能基准真值出现分歧。
50
+ - **升级处理**:跨阶段失败通常意味着集成问题。检查流水线的连接,而非单个组件。
51
51
 
52
- ## Script Naming Convention
52
+ ## 脚本命名规范
53
53
 
54
- | Prefix | Layer | Purpose | Examples |
54
+ | 前缀 | 层级 | 用途 | 示例 |
55
55
  |--------|-------|---------|----------|
56
- | `lint_` | L1-L2 | Fast, syntactic checks | `lint_json.py`, `lint_encoding.py`, `lint_schema.py` |
57
- | `validate_` | L3-L4 | Domain and logic validation | `validate_fields.py`, `validate_dates.py`, `validate_amounts.py` |
58
- | `cross_validate_` | L5 | Cross-phase consistency | `cross_validate_extraction.py`, `cross_validate_rules.py` |
56
+ | `lint_` | L1-L2 | 快速的语法层检查 | `lint_json.py`、`lint_encoding.py`、`lint_schema.py` |
57
+ | `validate_` | L3-L4 | 领域与逻辑校验 | `validate_fields.py`、`validate_dates.py`、`validate_amounts.py` |
58
+ | `cross_validate_` | L5 | 跨阶段一致性 | `cross_validate_extraction.py`、`cross_validate_rules.py` |
59
59
 
60
- Scripts should:
61
- - Accept a file or directory path as input.
62
- - Output structured JSON results (pass/fail per check, with reasons).
63
- - Return exit code 0 if all checks pass, non-zero otherwise.
64
- - Be idempotent — running twice produces the same result.
60
+ 脚本应当:
61
+ - 接受文件或目录路径作为输入。
62
+ - 输出结构化的 JSON 结果(每项检查的通过/不通过及原因)。
63
+ - 全部检查通过时退出码为 0,否则非零。
64
+ - 幂等——多次运行结果一致。
65
65
 
66
- ## QC vs Reflection
66
+ ## 质控 vs 反思
67
67
 
68
- | Dimension | QC (this skill) | Reflection (evolution-loop) |
68
+ | 维度 | 质控(本技能) | 反思(evolution-loop |
69
69
  |-----------|-----------------|---------------------------|
70
- | **Who runs it** | Coding agent or automated scripts | Coding agent |
71
- | **What triggers it** | Every batch, on schedule | QC failures, accuracy drops |
72
- | **Input** | Workflow outputs | QC reports, failure logs, iteration history |
73
- | **Output** | Pass/fail verdicts, accuracy metrics | Root cause diagnosis, fix proposals |
74
- | **Cost** | Low (mostly scripts, some LLM at L4-L5) | Higher (deep analysis, prompt rewriting) |
75
- | **When to use** | Always — every production batch | Only when QC reveals problems |
76
- | **Goal** | Detect problems | Fix problems |
70
+ | **谁来运行** | 编程智能体或自动化脚本 | 编程智能体 |
71
+ | **触发条件** | 每批次、按时调度 | 质控失败、准确率下降 |
72
+ | **输入** | 工作流的输出 | 质控报告、失败日志、迭代历史 |
73
+ | **输出** | 通过/不通过结论、准确率指标 | 根因诊断、修复方案 |
74
+ | **成本** | 低(多为脚本,L4-L5 涉及部分 LLM) | 较高(深度分析、提示词改写) |
75
+ | **使用时机** | 始终运行——每个生产批次都跑 | 仅在质控发现问题时启动 |
76
+ | **目标** | 发现问题 | 修复问题 |
77
77
 
78
- QC without Reflection detects issues but cannot fix them. Reflection without QC has no data to work from. They are complementary, not alternatives.
78
+ 只做质控而不反思,能发现问题但无法修复;只做反思而不做质控,则没有可供分析的数据。两者互补,并非替代关系。
79
79
 
80
- ## Integration Points
80
+ ## 集成点
81
81
 
82
- ### With `data-sensibility`
82
+ ### `data-sensibility`
83
83
 
84
- The `data-sensibility` skill provides input validation that feeds L1-L3. If data-sensibility checks flag a document as anomalous before processing, QC can prioritize reviewing that document's outputs. Data-sensibility operates on inputs; QC operates on outputs. Together they bracket the pipeline.
84
+ `data-sensibility` 技能提供输入侧的校验,为 L1-L3 喂数据。如果 data-sensibility 在处理前就将某文档标记为异常,质控可以优先复核该文档的输出。data-sensibility 关注输入;质控关注输出。两者首尾呼应,把整条流水线夹在中间。
85
85
 
86
- ### With `cross-document-verification`
86
+ ### `cross-document-verification`
87
87
 
88
- Cross-document verification enables L5 cross-doc consistency checks. When multiple documents reference the same entity (e.g., same contract number across invoice and purchase order), L5 can verify that extracted values are consistent across documents. Without cross-document verification, L5 is limited to single-document cross-phase checks.
88
+ 跨文档核查使 L5 的跨文档一致性检查成为可能。当多个文档引用同一实体(例如发票和采购订单中的同一合同号),L5 可以核查跨文档抽取值是否一致。没有跨文档核查时,L5 仅能进行单文档内的跨阶段检查。
89
89
 
90
- ### With `confidence-system`
90
+ ### `confidence-system`
91
91
 
92
- QC results calibrate the confidence system. When QC reveals that high-confidence results are sometimes wrong, the confidence thresholds need adjustment. Conversely, confidence scores drive QC sampling — low-confidence results get more review. This creates a feedback loop: QC improves confidence calibration, better calibration improves QC efficiency.
92
+ 质控结果用来校准置信度系统。当质控发现高置信度结果有时是错的,就需要调整置信度阈值。反过来,置信度分数也驱动质控抽样——低置信度结果获得更多复核。这形成一个反馈环:质控改进置信度校准,更好的校准又提升质控效率。
@@ -1,76 +1,76 @@
1
- # Sampling Strategies for Quality Control
1
+ # 质控抽样策略
2
2
 
3
- ## Adaptive Sampling
3
+ ## 自适应抽样
4
4
 
5
- The core idea: review more when you are uncertain, less when you are confident. Confidence grows with evidence — consecutive batches of high accuracy.
5
+ 核心思想:不确定时多复核,有把握时少复核。把握随证据增长——也就是连续多个批次的高准确率。
6
6
 
7
- ### Continuous Decay Model
7
+ ### 连续衰减模型
8
8
 
9
- Rather than cliff-edge transitions between phases, use a smooth exponential decay driven by observed accuracy:
9
+ 不要在阶段之间做悬崖式切换,而是用一条由实测准确率驱动的平滑指数衰减曲线:
10
10
 
11
11
  ```
12
12
  sampling_rate = max(floor_rate, exp(-λ × consecutive_successes))
13
13
  ```
14
14
 
15
- Where:
16
- - `consecutive_successes`: number of consecutive batches where accuracy meets or exceeds the threshold. **Resets to 0** whenever a batch's accuracy drops below the threshold. This is the self-correcting mechanism — quality drops immediately increase monitoring.
17
- - `λ` (decay speed): controlled by MONITOR_FREQUENCY in `.env`.
18
- - `floor_rate`: the minimum sampling rate, never goes below this.
15
+ 其中:
16
+ - `consecutive_successes`:准确率达到或超过阈值的连续批次数。**任何一个批次的准确率跌破阈值,立即重置为 0**。这是系统的自纠正机制——质量一旦下滑就立即提高监控频率。
17
+ - `λ`(衰减速度):由 `.env` 中的 MONITOR_FREQUENCY 控制。
18
+ - `floor_rate`:抽样率的下限,永远不低于此值。
19
19
 
20
- ### MONITOR_FREQUENCY Mapping
20
+ ### MONITOR_FREQUENCY 映射
21
21
 
22
- | Setting | λ | floor_rate | Character |
22
+ | 设置 | λ | floor_rate | 风格 |
23
23
  |---------|---|------------|-----------|
24
- | `high` | 0.1 | 0.10 | Slow decay, cautious — for high-stakes verification where errors are costly |
25
- | `mid` | 0.2 | 0.05 | Balanced decay — standard for most scenarios |
26
- | `low` | 0.3 | 0.05 | Fast decay — for well-understood domains with simple rules |
24
+ | `high` | 0.1 | 0.10 | 衰减慢,谨慎——适用于高风险核查,错漏代价大 |
25
+ | `mid` | 0.2 | 0.05 | 平衡衰减——多数场景的标准设置 |
26
+ | `low` | 0.3 | 0.05 | 衰减快——适用于规则简单、域知识成熟的场景 |
27
27
 
28
- As a rough mental model of the curve shape (for `mid`):
29
- - After 1 success: ~82% sampling
30
- - After 3 successes: ~55%
31
- - After 5 successes: ~37%
32
- - After 10 successes: ~14%
33
- - After 15 successes: ~5% (floor)
28
+ 以下是该曲线形状的粗略心理模型(`mid` 配置下):
29
+ - 连续 1 次成功后:约 82% 抽样
30
+ - 连续 3 次成功后:约 55%
31
+ - 连续 5 次成功后:约 37%
32
+ - 连续 10 次成功后:约 14%
33
+ - 连续 15 次成功后:约 5%(下限)
34
34
 
35
- These numbers, the formula, and even the exponential shape are recommended defaults. The coding agent and developer user should discuss and calibrate based on the specific business scenario. If a different decay function (linear, sigmoid, or hand-tuned) works better, use it. The framework — accuracy-driven decay with reset on quality drop — matters more than the specific formula.
35
+ 这些数字、这个公式、乃至指数形状本身,都是推荐默认值。编程智能体应与开发者用户讨论后,根据具体业务场景做校准。如果其他衰减函数(线性、sigmoid 或人工调好的曲线)更合适,就用它。重要的是框架——"由准确率驱动、质量下滑立即重置"——而不是某条具体公式。
36
36
 
37
- ## Priority Sampling
37
+ ## 优先级抽样
38
38
 
39
- Not all results are equally worth reviewing. Priority sampling ensures that the most informative results are always in the review set:
39
+ 不是所有结果都同样值得复核。优先级抽样确保信息量最高的结果始终进入复核集合:
40
40
 
41
- ### Always Review
42
- - Results where the workflow reported low confidence (below the full-review threshold from `confidence-system`).
43
- - Results where the workflow produced an error or missing result.
44
- - Results from document types not seen during skill/workflow testing.
41
+ ### 必须复核
42
+ - 工作流自报置信度偏低的结果(低于 `confidence-system` 中的全量复核阈值)。
43
+ - 工作流报错或结果缺失的条目。
44
+ - 来自技能/工作流测试中未出现过的文档类型的结果。
45
45
 
46
- ### Usually Review
47
- - Results where the workflow's confidence is in the medium band.
48
- - Results from rules that historically have lower accuracy.
49
- - Results from the first occurrence of a new document format or variant.
46
+ ### 通常复核
47
+ - 工作流置信度处于中段的结果。
48
+ - 历史准确率较低的规则产出的结果。
49
+ - 新文档格式或变体首次出现时的结果。
50
50
 
51
- ### Spot-Check
52
- - Results with high confidence from rules that historically have high accuracy.
53
- - These are selected randomly from the high-confidence pool.
54
- - The purpose is regression detection, not active improvement.
51
+ ### 抽查
52
+ - 来自历史准确率高的规则、且置信度高的结果。
53
+ - 从高置信度池中随机挑选。
54
+ - 目的在于回归检测,不在于主动改进。
55
55
 
56
- ## Stratified Sampling
56
+ ## 分层抽样
57
57
 
58
- When documents vary significantly in complexity or type, stratify the sample:
58
+ 当文档在复杂度或类型上差异显著时,对样本进行分层:
59
59
 
60
- 1. **Group documents** by type, complexity, or any relevant characteristic.
61
- 2. **Sample proportionally** from each group, ensuring that minority groups are represented.
62
- 3. **Over-sample** from groups that historically have lower accuracy.
60
+ 1. **分组**:按文档类型、复杂度或任何相关特征划分。
61
+ 2. **按比例抽样**:从每个分组按比例抽取,确保少数派分组也有代表。
62
+ 3. **过采样**:对历史准确率较低的分组提高采样比例。
63
63
 
64
- This prevents the random sample from being dominated by easy documents while missing systematic failures in hard documents.
64
+ 这样可以防止随机样本被简单文档主导,从而错过难文档中的系统性失败。
65
65
 
66
- ## Confidence Calibration Check
66
+ ## 置信度校准检查
67
67
 
68
- Periodically (every N batches), run a calibration check:
68
+ N 个批次定期做一次校准检查:
69
69
 
70
- 1. Take a random sample of high-confidence results.
71
- 2. Review them (LLM-as-Judge or human).
72
- 3. Compare: are 90%+ of "high confidence" results actually correct?
73
- 4. If not, the confidence system needs recalibration (see `confidence-system` skill).
74
- 5. If yes, you can safely reduce the sampling rate for high-confidence results.
70
+ 1. 从高置信度结果中随机抽取样本。
71
+ 2. 复核(用 LLM-as-Judge 或人工)。
72
+ 3. 比对:是否 90%+ "高置信度"结果确实是正确的?
73
+ 4. 若不是,置信度系统需要重新校准(参见 `confidence-system` 技能)。
74
+ 5. 若是,则可放心降低高置信度结果的抽样率。
75
75
 
76
- This is a meta-check on the quality of the quality control system itself.
76
+ 这是对质控系统本身质量的一次元层级检查。