kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +47 -0
  2. package/package.json +3 -2
  3. package/src/agent/context.js +17 -1
  4. package/src/agent/engine.js +467 -100
  5. package/src/agent/llm-client.js +24 -1
  6. package/src/agent/pipelines/_advance-hints.js +92 -0
  7. package/src/agent/pipelines/_milestone-derive.js +325 -20
  8. package/src/agent/pipelines/skill-authoring.js +49 -3
  9. package/src/agent/tools/agent-tool.js +2 -2
  10. package/src/agent/tools/consult-skill.js +15 -0
  11. package/src/agent/tools/dashboard-render.js +48 -1
  12. package/src/agent/tools/document-parse.js +31 -2
  13. package/src/agent/tools/phase-advance.js +17 -13
  14. package/src/agent/tools/release.js +343 -7
  15. package/src/agent/tools/sandbox-exec.js +65 -8
  16. package/src/agent/tools/worker-llm-call.js +95 -15
  17. package/src/agent/workspace.js +25 -4
  18. package/src/cli/components.js +4 -1
  19. package/src/cli/index.js +125 -8
  20. package/src/config.js +19 -2
  21. package/src/marathon/driver.js +217 -0
  22. package/src/marathon/prompts.js +93 -0
  23. package/template/.env.template +17 -1
  24. package/template/AGENT.md +2 -2
  25. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  26. package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
  27. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  28. package/template/skills/en/confidence-system/SKILL.md +30 -8
  29. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  30. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  31. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  32. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  33. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  34. package/template/skills/en/document-chunking/SKILL.md +99 -15
  35. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  36. package/template/skills/en/quality-control/SKILL.md +23 -0
  37. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  38. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  39. package/template/skills/en/skill-authoring/SKILL.md +85 -2
  40. package/template/skills/en/skill-creator/SKILL.md +25 -3
  41. package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
  42. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  43. package/template/skills/en/tree-processing/SKILL.md +1 -1
  44. package/template/skills/en/version-control/SKILL.md +15 -0
  45. package/template/skills/en/work-decomposition/SKILL.md +52 -32
  46. package/template/skills/phase_skills.yaml +5 -0
  47. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  48. package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
  49. package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
  50. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  51. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  52. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  53. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  54. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  55. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  56. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  57. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  58. package/template/skills/zh/document-chunking/SKILL.md +101 -18
  59. package/template/skills/zh/document-parsing/SKILL.md +65 -65
  60. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  61. package/template/skills/zh/entity-extraction/SKILL.md +78 -68
  62. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  63. package/template/skills/zh/quality-control/SKILL.md +23 -0
  64. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  65. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  66. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  67. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  68. package/template/skills/zh/skill-authoring/SKILL.md +136 -58
  69. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  70. package/template/skills/zh/skill-creator/SKILL.md +215 -201
  71. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  72. package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
  73. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  74. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  75. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  76. package/template/skills/zh/tree-processing/SKILL.md +67 -63
  77. package/template/skills/zh/version-control/SKILL.md +15 -0
  78. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  79. package/template/skills/zh/work-decomposition/SKILL.md +52 -30
  80. package/template/workflows/common/llm_client.py +168 -0
  81. package/template/workflows/common/utils.py +132 -0
@@ -4,99 +4,99 @@ tier: meta
4
4
  description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
5
5
  ---
6
6
 
7
- # Document Parsing
7
+ # 文档解析
8
8
 
9
- Parsing is the foundation. If the text is wrong, everything downstream is wrong. But parsing is also a cost center — do not use expensive vision models when simple text extraction works.
9
+ 解析是整个工作流的根基。一旦文本提取错了,下游的所有规则判断、对账、合规核查都会跟着一起错,而且这种错误往往很难在后续环节里被发现——因为下游看到的只是"一段文字",并不知道这段文字其实已经丢失了关键数字或错位了表头。但解析同时也是一个明显的成本中心——能用简单文本抽取解决的事情,就不要轻易动用昂贵的视觉模型或重型 OCR 流水线,这只会拖慢整体节奏并消耗预算。
10
10
 
11
- ## The Minimum Viable Parser Principle
11
+ ## 最小可用解析器原则
12
12
 
13
- Start with the simplest parser. Escalate only when necessary. This is not about saving money — it is about producing the most reliable output. Simple parsers have fewer failure modes.
13
+ 从最简单的解析器开始。只有在确有必要时才向上升级。这条原则并不是单纯为了省钱——更重要的是为了产出最可靠、最稳定的结果。简单的解析器失败模式更少、更可控,排查起来也更直接;一旦堆叠了过多的处理层,出问题时几乎无法定位究竟是哪一层引入了偏差,调试成本会指数级上升。优先选择能用就用的最低层级,把复杂性留给真正需要的文档类型。
14
14
 
15
- ### Level 1: Direct Text Extraction
16
- - Tool: pdfjs-dist or similar PDF text extraction.
17
- - When: Well-formed digital PDFs with embedded text. This covers most modern business documents.
18
- - Output: Raw text with basic structure preserved (paragraphs, basic formatting).
19
- - Limitations: Tables may come out as messy text. Charts and images are invisible. Scanned PDFs produce nothing.
15
+ ### Level 1: 直接文本抽取
16
+ - 工具:pdfjs-dist,或其它类似的 PDF 文本抽取库。
17
+ - 适用场景:结构良好、内嵌文本的数字版 PDF。绝大多数现代商业文档(年报、招股说明书、合同、技术手册)都属于这一类。
18
+ - 输出:保留基本结构(段落、基础格式)的原始文本。
19
+ - 局限:表格往往会被拉平成杂乱的文字流,行列关系丢失。图表与图片完全不可见。扫描件 PDF 抽不出任何内容,加密 PDF 也会直接失败。
20
20
 
21
- ### Level 2: Provider VLM (Vision Language Model)
22
- - Tool: VLM models from configured provider (VLM_TIER3 for cheap OCR, VLM_TIER1 for complex interpretation).
23
- - When: Level 1 produces garbled/incomplete text, scanned PDFs, image-based PDFs.
24
- - Output: Recognized text from page images, or structured interpretation (table as markdown, chart data as JSON).
25
- - Calling a provider VLM is more convenient and reliable than deploying local OCR. Use the cheapest VLM tier first; escalate to a more capable tier for complex tables/charts.
21
+ ### Level 2: Provider VLM(视觉语言模型)
22
+ - 工具:已配置 provider VLM 模型(VLM_TIER3 用于低成本 OCR,VLM_TIER1 用于复杂内容的语义解读)
23
+ - 适用场景:Level 1 产出乱码或文本残缺、扫描件 PDF、图像型 PDF,以及版式高度复杂、列布局或注脚混乱的报告。
24
+ - 输出:从页面图像中识别出的文本,或更进一步的结构化解读结果(表格转成 markdown、图表数据转成 JSON)
25
+ - 调用 provider VLM 通常比自行部署本地 OCR 更便捷、更稳定,也省去了维护模型权重、显卡资源与推理环境的整套开销。优先使用最便宜的 VLM 层级把内容先拿到手并完成基本评估;只有遇到复杂的表格、密集的图表,或更低层级在测试样本上确实不够用时,再升级到更强的层级,而不是一开始就用旗舰模型把所有页面跑一遍。
26
26
 
27
- ### Level 3: MineRU API or Local Tools (Optional)
28
- - Tool: MineRU API, pdfplumber, or locally deployed OCR — if configured.
29
- - When: Provider VLM is unavailable or too expensive for batch processing.
30
- - These are optional fallbacks. Most users will use Level 1 + Level 2.
27
+ ### Level 3: MineRU API 或本地工具(可选)
28
+ - 工具:MineRU APIpdfplumber,或本地部署的 OCR——前提是已经配置好。
29
+ - 适用场景:provider VLM 不可用,或批量处理时整体费用过高。
30
+ - 这些属于可选的兜底方案,并不是默认路径。大多数用户使用 Level 1 + Level 2 的组合就已经足够覆盖日常需求。
31
31
 
32
- ## Quality Detection
32
+ ## 质量检测
33
33
 
34
- How to know when to escalate:
34
+ 如何判断当前层级该不该向上升级:
35
35
 
36
- - **Low character count**: The document has pages but extracted text is very short. Likely a scanned PDF.
37
- - **Garbled text**: Unusual character sequences, encoding errors, or meaningless text patterns.
38
- - **Missing expected sections**: The table of contents mentions Chapter 5 but no Chapter 5 text was extracted.
39
- - **Table artifacts**: Columns of numbers without alignment, cell content mixed with headers, or table borders appearing as characters.
40
- - **Missing numbers in financial tables**: If a financial document's key metrics are not in the extracted text, the tables were probably not parsed.
36
+ - **字符数过低**:文档有多页,但抽取出的文本极短。大概率是扫描件 PDF,或者是带有大量图像内容的 PDF
37
+ - **乱码文本**:出现异常的字符序列、明显的编码错误,或一眼看上去毫无意义的文字模式。
38
+ - **预期章节缺失**:目录中提到了第 5 章,但抽取结果里完全没有第 5 章的正文,这通常意味着排版方式让文本抽取器卡住了。
39
+ - **表格残骸**:成列的数字没有对齐、单元格内容与表头混在一起、表格边框被识别成无意义的字符流。
40
+ - **财务表格中关键数字缺失**:如果一份财务文档的关键指标没有出现在抽取文本里,那大概率就是表格没有被正确解析,而不是它们本身就不存在。
41
41
 
42
- Write a quick quality check after parsing and before proceeding. If quality is insufficient, escalate to the next parser level.
42
+ 在解析之后、继续后续处理之前,写一个快速的质量检查脚本。如果质量不达标,就升级到下一层解析器,而不是带病往下走。
43
43
 
44
- ### Parse Quality Score
44
+ ### 解析质量分数
45
45
 
46
- Compute a quality score (0.0 to 1.0) from weighted heuristics to make escalation decisions systematic rather than ad-hoc. A recommended starting framework:
46
+ 通过加权启发式规则计算一个 0.0 1.0 的质量分数,让升级决策有据可依,而不是单纯凭感觉做判断。一个推荐的初始框架如下:
47
47
 
48
- - **Character density** (weight ~0.3): actual character count / expected characters for the document's page count. A 10-page PDF that yields only 200 characters likely failed.
49
- - **Garble ratio** (weight ~0.2): fraction of characters that are common CJK/Latin vs control characters, unusual sequences, or encoding artifacts.
50
- - **Section completeness** (weight ~0.3): if the document has a table of contents, what fraction of TOC entries have matching content in the extracted text?
51
- - **Table integrity** (weight ~0.2): for financial documents, are key numeric values that should appear in tables actually present in the extracted text?
48
+ - **字符密度**(权重约 0.3):实际字符数 / 该文档页数下的预期字符数。一份 10 页的 PDF 如果只抽出 200 个字符,基本可以直接判定为失败。
49
+ - **乱码比例**(权重约 0.2):常见中英文字符占总字符的比例,相对于控制字符、异常序列或编码残留物的比例。
50
+ - **章节完整度**(权重约 0.3):如果文档自带目录,目录条目中有多少比例能在抽取文本里找到对应内容?
51
+ - **表格完整性**(权重约 0.2):对于财务类文档,本应出现在表格中的关键数值,是否真的出现在抽取文本中?
52
52
 
53
- **Escalation thresholds** (recommended defaults — adjust freely):
54
- - Score >= 0.7: accept this parser level, proceed to downstream processing.
55
- - Score 0.4-0.7: escalate to the next parser level, re-parse, re-score.
56
- - Score < 0.4: skip directly to Level 3 (OCR) or Level 4 (vision) depending on document characteristics.
53
+ **升级阈值**(推荐默认值——可根据实际情况自由调整):
54
+ - 分数 >= 0.7:接受当前解析器层级,进入下游处理。
55
+ - 分数 0.4-0.7:升级到下一层解析器,重新解析,重新打分。
56
+ - 分数 < 0.4:直接跳过中间层级,根据文档特征跳到 Level 3(OCR) Level 4(视觉模型)
57
57
 
58
- **Lock-in**: once a parser level produces an acceptable score for a document type, record that level. Do not re-evaluate unless a downstream verification failure is traced back to a parsing issue.
58
+ **层级锁定**:一旦某个解析器层级在某类文档上跑出可接受的分数,就把这个层级记录下来。除非下游某次验证失败被追溯到解析环节,否则不要再花时间重复评估,这能避免在已经稳定的链路上反复折腾。
59
59
 
60
- These weights, thresholds, and the scoring approach itself are starting points. The coding agent should design whatever quality assessment works for the specific document types at hand — a simple pass/fail heuristic may be sufficient for some scenarios; a more nuanced scoring function may be needed for others. The important pattern is: **measure quality compare to threshold decide whether to escalate**.
60
+ 上述权重、阈值,乃至打分思路本身,都只是一个起点。编码 agent 应当为手头的具体文档类型设计真正合适的质量评估方案——某些场景下简单的通过/失败启发式就完全够用,另一些场景则可能需要更精细的分项打分函数。真正关键的模式是:**度量质量与阈值比较决定是否升级**,而不是某一组具体的权重数字。
61
61
 
62
- This follows the same tier-transition pattern as model tier selection in `skill-to-workflow`: a quality/accuracy score drives the decision to stay, escalate, or skip tiers.
62
+ 这与 `skill-to-workflow` 中模型层级选择遵循的层级跃迁模式是一致的:都由一个质量或准确率分数,来驱动"停留在当前层级、升级到下一层级、还是跳过中间层直接进入更高层"这三种决策。
63
63
 
64
- ## Table Handling
64
+ ## 表格处理
65
65
 
66
- Tables are critical in financial documents (balance sheets, ratio tables, compliance metrics). They deserve special attention:
66
+ 表格在财务文档中至关重要(资产负债表、比率表、合规指标表、监管披露表),它们承载了规则验证最常引用的数值,因此值得专门对待:
67
67
 
68
- 1. **Detection**: Identify table regions. Look for grid patterns, consistent column spacing, or explicit table markers.
69
- 2. **Extraction**: Extract cell-by-cell content. Preserve the row-column relationship.
70
- 3. **Reconstruction**: Convert to a structured format (markdown table, JSON array of rows, or CSV).
71
- 4. **Validation**: Spot-check that key values in the reconstructed table match what is visible in the document.
68
+ 1. **检测**:识别出页面中的表格区域。寻找规则的网格模式、稳定的列间距,或文档自身的显式表格标记。
69
+ 2. **抽取**:逐单元格提取内容。严格保留行列对应关系,不要把多列合并成一行长字符串。
70
+ 3. **重建**:把抽取结果转换成结构化格式(markdown 表格、JSON 行数组,或 CSV),便于下游程序消费。
71
+ 4. **校验**:对重建后的表格抽查若干关键单元格,确认其数值与文档中肉眼可见的值一致,以此尽早发现错位或漏行。
72
72
 
73
- When the standard parser fails on tables, try the vision model approach: send the table image (cropped from the PDF page) to a vision model and ask it to produce a markdown table.
73
+ 如果标准解析器在表格上失手,可以尝试视觉模型路线:把表格图像( PDF 页面上精确裁剪出来)发送给视觉模型,让它直接产出一个 markdown 表格。这种方式对带合并单元格、跨页延续的复杂表格尤其有效。
74
74
 
75
- ## Chart Handling
75
+ ## 图表处理
76
76
 
77
- Charts (bar charts, line charts, pie charts) occasionally contain data needed for verification:
77
+ 图表(柱状图、折线图、饼图、雷达图)偶尔也会承载验证所需的数据:
78
78
 
79
- - Extract the chart image from the document.
80
- - Send to a vision model with a prompt: "Extract the data points, labels, and values from this chart. Return as a JSON array."
81
- - Validate the extracted data against any nearby text or table that might contain the same numbers.
79
+ - 从文档中抽出图表图像,保留必要的坐标轴与图例区域。
80
+ - 发送给视觉模型,prompt 类似:"Extract the data points, labels, and values from this chart. Return as a JSON array."
81
+ - 用图表附近的文字段落或表格(它们经常会以另一种形式呈现相同的数字)对抽取结果进行交叉校验。
82
82
 
83
- This is expensive. Only do it when a verification rule specifically requires data from a chart and that data is not available in text elsewhere in the document.
83
+ 这一步代价相对昂贵,而且图表识别的准确率天然低于纯文本抽取。只有当某条验证规则明确要求使用图表中的数据,且该数据无法从文档其他文字或表格中获得时,才动用它,而不是默认对所有图表都跑一遍视觉模型。优先尝试在邻近文字中找到相同的数字,这通常是更经济也更可信的路径。
84
84
 
85
- ## Output Format
85
+ ## 输出格式
86
86
 
87
- Parsed documents should be saved as clean markdown:
87
+ 解析后的文档应当保存为干净、规范的 markdown:
88
88
 
89
- - Preserve the document's heading hierarchy (# Chapter, ## Section, ### Subsection).
90
- - Preserve lists, numbered or bulleted.
91
- - Convert tables to markdown table format.
92
- - Note page boundaries if relevant (some rules reference specific pages).
93
- - Strip noise: headers, footers, page numbers, watermarks (unless a rule specifically checks for them).
89
+ - 保留文档原有的标题层级(# 章、## 节、### 小节),不要把所有标题压平到同一层。
90
+ - 保留列表结构,无论是有序列表还是无序列表。
91
+ - 把表格转换成 markdown 表格格式,而不是用纯文字描述。
92
+ - 在有需要时标注页边界(部分规则会引用具体页码,这种位置信息必须保留)
93
+ - 剔除噪声:页眉、页脚、页码、水印——除非某条规则专门要检查它们。
94
94
 
95
- Save parsed output alongside the original document for reuse across rules.
95
+ 把解析输出与原始文档保存在一起,方便跨规则、跨技能复用同一份解析结果。
96
96
 
97
- ## Caching
97
+ ## 缓存
98
98
 
99
- Parsing is expensive (especially Level 3-4). Cache parsed output:
100
- - Store the parsed markdown alongside the original file.
101
- - Track which parser level produced it.
102
- - Re-parse only when: the original file changes, a rule requires higher-quality parsing than what is cached, or a verification failure is traced back to a parsing issue.
99
+ 解析这一步本身很昂贵(尤其是 Level 3-4),在迭代规则、反复跑验证流程时尤其明显。务必把解析结果缓存下来,而不是每次都从原始 PDF 重新跑一遍:
100
+ - 把解析后的 markdown 与原始文件保存在同一目录,命名约定要稳定、可预测,便于程序化查找。
101
+ - 同时记录是哪一层解析器产出的,以及该层级当时拿到的质量分数,方便日后回溯与对比。
102
+ - 仅在以下情况重新解析:原始文件发生变化、某条规则要求比当前缓存更高质量的解析结果,或某次验证失败被追溯到了解析问题本身。
@@ -1,40 +1,40 @@
1
- # Parser Catalog
1
+ # 解析器目录
2
2
 
3
- ## Text-Based Parsers (No LLM Required)
3
+ ## 文本类解析器(无需 LLM
4
4
 
5
- | Parser | Type | Strengths | Limitations | Install |
5
+ | 解析器 | 类型 | 优势 | 局限 | 安装 |
6
6
  |--------|------|-----------|-------------|---------|
7
- | PyMuPDF (fitz) | Text extraction | Fast, reliable, basic structure | No table awareness, no OCR | `pip install pymupdf` |
8
- | pdfplumber | Layout-aware | Good table detection, spatial layout | Text-only, no OCR | `pip install pdfplumber` |
9
- | python-docx | DOCX parser | Native DOCX support, preserves structure | DOCX only | `pip install python-docx` |
10
- | openpyxl | XLSX parser | Full spreadsheet support | XLSX only | `pip install openpyxl` |
11
- | MarkItDown | Multi-format | Handles PDF, DOCX, PPTX, XLSX → markdown | Basic parsing, may miss complex layouts | `pip install markitdown` |
7
+ | PyMuPDF (fitz) | 文本抽取 | 快、稳定、基础结构识别 | 不识别表格、不支持 OCR | `pip install pymupdf` |
8
+ | pdfplumber | 版面感知 | 表格检测良好,保留空间布局 | 仅文本,不支持 OCR | `pip install pdfplumber` |
9
+ | python-docx | DOCX 解析器 | 原生支持 DOCX,保留结构 | 仅支持 DOCX | `pip install python-docx` |
10
+ | openpyxl | XLSX 解析器 | 完整支持电子表格 | 仅支持 XLSX | `pip install openpyxl` |
11
+ | MarkItDown | 多格式 | 处理 PDFDOCXPPTXXLSX → markdown | 解析较基础,复杂版面可能丢失 | `pip install markitdown` |
12
12
 
13
- ## OCR / Vision Models (Via SiliconFlow API)
13
+ ## OCR / 视觉模型(通过 SiliconFlow API
14
14
 
15
- | Model | Tier | Strengths | Best For |
15
+ | 模型 | 等级 | 优势 | 最适合 |
16
16
  |-------|------|-----------|----------|
17
- | zai-org/GLM-4.6V | OCR_TIER1 | Best accuracy, strong Chinese OCR | Complex tables, mixed layouts |
18
- | Qwen/Qwen3.5-397B-A17B | OCR_TIER2 | Good general vision, large model | Tables with context-dependent interpretation |
19
- | PaddlePaddle/PaddleOCR-VL-1.5 | OCR_TIER3 | Fast, lightweight | Standard text, simple tables |
17
+ | zai-org/GLM-4.6V | OCR_TIER1 | 准确率最高,中文 OCR | 复杂表格、混合版面 |
18
+ | Qwen/Qwen3.5-397B-A17B | OCR_TIER2 | 通用视觉能力好,模型规模大 | 需要结合上下文理解的表格 |
19
+ | PaddlePaddle/PaddleOCR-VL-1.5 | OCR_TIER3 | 快、轻量 | 标准文本、简单表格 |
20
20
 
21
- ## Local Deployment Options
21
+ ## 本地部署选项
22
22
 
23
- For developer users who prefer local processing:
23
+ 适合偏好本地处理的开发者用户:
24
24
 
25
- | Tool | Type | Notes |
25
+ | 工具 | 类型 | 备注 |
26
26
  |------|------|-------|
27
- | PaddleOCR | Local OCR | Open source, supports Chinese/English |
28
- | Surya | Local OCR | Modern OCR with table detection |
29
- | pdf2md-local | PDF → Markdown | Reference: github.com/Ruilin-mmwa/pdf2md-local |
27
+ | PaddleOCR | 本地 OCR | 开源,支持中英文 |
28
+ | Surya | 本地 OCR | 现代 OCR,支持表格检测 |
29
+ | pdf2md-local | PDF → Markdown | 参考:github.com/Ruilin-mmwa/pdf2md-local |
30
30
 
31
- ## Selection Decision Tree
31
+ ## 选型决策树
32
32
 
33
33
  ```
34
- Is the PDF text-based (not scanned)?
35
- ├─ Yes → PyMuPDF or pdfplumber
36
- │ └─ Are tables parsed correctly?
37
- │ ├─ YesDone
38
- │ └─ NoTry pdfplumber → If still bad Vision model on table regions
39
- └─ No (scanned) → OCR_TIER3 → If quality insufficient → OCR_TIER1
34
+ PDF 是文本型(非扫描件)吗?
35
+ ├─ → PyMuPDF pdfplumber
36
+ │ └─ 表格解析正确吗?
37
+ │ ├─ 完成
38
+ │ └─ 改用 pdfplumber → 仍不理想对表格区域使用视觉模型
39
+ └─ 否(扫描件) → OCR_TIER3 → 质量不足 → OCR_TIER1
40
40
  ```
@@ -4,61 +4,59 @@ tier: meta
4
4
  description: Extract specific entities, values, and text segments from documents as required by verification rules. Use after tree processing has located the relevant section, when a rule needs a specific number, date, name, amount, clause, or any domain-specific entity extracted. Covers extraction method selection (regex vs LLM), schema design, postprocessing, and confidence annotation. Also use when designing the extraction step of a workflow for worker LLMs.
5
5
  ---
6
6
 
7
- # Entity Extraction
7
+ # 实体提取
8
8
 
9
- An entity is the thing you need to check. A number, a date, a name, a clause, a percentage, a statement. The rule says what to check; extraction is how you get the value to check it against.
9
+ 实体就是你需要核查的对象:一个数字、一个日期、一个名称、一个条款、一个百分比、一段陈述。规则告诉你要核查什么,提取负责把可核查的值从原文中取出来。换句话说,规则定义了"核查目标",而实体提取是把这个目标从纸面化为程序可比较、可判定的结构化数据的关键一步。没有可靠的提取,后续的判定和报告都是空中楼阁;提取阶段每多一分误差,下游的判定和汇总就会把这分误差放大。在金融与监管合规这类对数字、口径、时点极其敏感的场景里,提取的稳定性直接决定整套验证流程能否被信任。
10
10
 
11
- ## Extraction Type Taxonomy
11
+ ## 提取场景分类
12
12
 
13
- Different extraction scenarios call for different approaches:
13
+ 不同的提取场景需要不同的策略。先对照下面四类,识别当前规则属于哪一种,再去选具体的实现方法:
14
14
 
15
- ### Single Entity from Single Section
16
- The simplest case. One rule needs one value from one place.
17
- - Example: "Extract the capital adequacy ratio from the Key Metrics table."
18
- - Approach: Locate the section, apply regex or LLM extraction.
15
+ ### 单一章节中的单一实体
16
+ 最简单的情况。一条规则只需要从一个固定位置取一个值。
17
+ - 示例:"从关键指标表中提取资本充足率。"
18
+ - 思路:先通过树处理定位到对应章节,再用正则或 LLM 对该段文本做一次提取即可。
19
19
 
20
- ### Multiple Entities from Single Section
21
- One rule needs several related values from the same place.
22
- - Example: "Extract the borrower's name, loan amount, interest rate, and maturity date from the loan agreement summary."
23
- - Approach: Design a single extraction call that returns all values. More efficient than multiple calls.
20
+ ### 单一章节中的多个实体
21
+ 一条规则需要从同一段落或同一张表里取出多个相关的值。
22
+ - 示例:"从贷款协议摘要中提取借款人姓名、贷款金额、利率和到期日。"
23
+ - 思路:设计一次提取调用,让模型或脚本一次性返回所有字段。比拆成多次调用更高效,也更容易保持字段之间的一致性,避免重复加载相同的上下文。
24
24
 
25
- ### Single Entity from Multiple Sections
26
- One value is scattered across multiple places, or needs cross-referencing.
27
- - Example: "Extract the total collateral value, which may be listed in the collateral section or in Appendix A."
28
- - Approach: Collect content from all relevant sections, then extract. Note which source the value came from.
25
+ ### 多个章节中的单一实体
26
+ 同一个值分散在多个位置,或需要交叉比对、汇总。
27
+ - 示例:"提取抵押物总价值,该值可能列在抵押物章节,也可能列在附件 A"
28
+ - 思路:先把所有可能含有该值的章节内容汇总,再做一次提取。务必在结果中标注该值的来源章节,方便后续追溯和审计。
29
29
 
30
- ### Entity from Full Document
31
- The value could be anywhere, or the rule applies to the document as a whole.
32
- - Example: "Check whether the document contains a valid signature page."
33
- - Approach: For the coding agent, scan the full document. For worker LLM workflows, design a two-pass approach: first pass identifies the location, second pass extracts the value.
30
+ ### 全文档级别的实体
31
+ 该值可能出现在任意位置,或者规则本身是针对整份文档的属性。
32
+ - 示例:"核查文档是否包含有效的签章页。"
33
+ - 思路:对于编码 agent,可以直接扫描整份文档。对于 worker LLM 工作流,建议设计两遍流程:第一遍粗扫整篇定位候选位置,第二遍只对候选段做精提取。这样能避免把整篇文档塞进单次调用导致上下文超限,也便于在候选阶段做并行化处理。
34
34
 
35
- ## Method Selection
35
+ ## 方法选择
36
36
 
37
- Extraction method selection is a cost-accuracy search. The goal is finding the cheapest method that meets the accuracy threshold. Regex is the smallest, cheapest "model" — zero cost, instant, deterministic. Worker LLM is more capable but costs tokens and time. Any search strategy is valid: try the cheapest first and escalate, try the most capable first and downgrade, bisect, or jump directly to a known-good method based on past experience in AGENT.md.
37
+ 提取方法的选择本质上是一次成本-准确率的搜索。目标是找到能稳定达到准确率阈值的最低成本方案。正则表达式是最小、最便宜的"模型"——零成本、即时、确定性、可重放、可被单元测试覆盖。Worker LLM 能力更强,覆盖语义层面的提取需求,但消耗 tokens 和时间,且每次输出可能存在细微差异,需要后处理与校验来兜底。任何搜索策略都成立:可以先试最便宜的方法再逐步升级,也可以先试最强的方法再逐步降级,可以在中间档位做二分查找,也可以基于 AGENT.md 中沉淀的历史经验直接跳到已验证可行的方法上,不必每次都从零开始重新试错。重要的是先想清楚"达标"的标准是什么,再开始搜索,否则容易在没有目标的情况下无限抬高成本。
38
38
 
39
- ### Available Methods
39
+ ### 可用方法
40
40
 
41
- **Regex / Python** Cost: zero. Speed: instant. Deterministic.
42
- Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
41
+ **正则 / Python** —— 成本:零。速度:即时。结果确定。适用场景:日期、金额、百分比、标识符、固定短语、编号、电话、地址等任何格式可预测的值。任何能写出清晰格式约束的字段,都应该优先考虑正则。
43
42
 
44
- **Worker LLM** Cost: API tokens. Speed: seconds. Semantic understanding.
45
- Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
43
+ **Worker LLM** —— 成本:API tokens。速度:秒级。具备语义理解能力。适用场景:需要结合上下文判断、条件性取值、语义匹配、结构模糊、识别误导性或暗示性表述、表格语义解读,凡是依赖理解而非模式匹配的任务。Worker LLM 在表面形式不稳定但语义清晰的场景下尤其有价值。
46
44
 
47
- Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
45
+ 实际验证任务中存在大量需要语义理解的场景——"这段描述是否具有误导性?""该条款是否充分披露风险?""该担保人的业务描述是否与其所述行业一致?"、"产品类型表述与底层资产是否匹配?"——这些都不是正则能处理的问题。遇到此类任务,毫不犹豫地使用 worker LLM;不要为了节省 tokens 而把不适合的任务硬塞给正则,否则就是用低成本换高漏报或高误报,最终在审计或复核环节付出更大代价。
48
46
 
49
- ### The Search
47
+ ### 搜索过程
50
48
 
51
- If a method's results fall below the accuracy threshold, try a different method or a more capable model. If regex works and meets accuracy — keep it, it's free. If regex produces results below threshold, escalate to worker LLM. If a cheap worker LLM isn't accurate enough, try a more capable tier. Record what works for each extraction type in AGENT.md for future reference.
49
+ 如果某个方法的结果低于准确率阈值,就换一种方法或换一档更强的模型。正则可行且达标——保留它,反正免费、稳定、可回放。正则结果不达标,升级到 worker LLM。便宜档的 worker LLM 不够准确,再换更高一档的模型。每个项目都应当把"哪类提取适用哪个方法"沉淀到 AGENT.md 里,作为后续同类规则的参考;这样下一条规则上来就能直接选对档位,而不是每次都从最便宜的方法重新搜索一遍。同时记录失败案例和阈值不达标的边界条件,让后续同类规则可以提前避坑、节省迭代成本。
52
50
 
53
- ## Project Glossary
51
+ ## 项目术语表
54
52
 
55
- The project glossary (built and maintained by `rule-extraction`, stored at `rules/glossary.json`) is a useful resource when designing extraction. It records canonical names and known aliases for entities that appear across rules. Reading it before extracting helps keep entity names schema-aligned and avoids parallel labels for the same thing.
53
+ 项目术语表(由 `rule-extraction` 构建并维护,存储在 `rules/glossary.json`)是设计提取时的有用资源。它记录了在多条规则中反复出现的实体的规范名称以及已知别名。在动手提取前先读一遍术语表,有助于保持实体命名与项目 schema 对齐,避免对同一事物使用并列的不同标签——例如同一个字段在不同规则里被叫作"资本充足率"、"资本充足比例"、"CAR"等,最终汇总报告时就会出现重复或漏匹配。
56
54
 
57
- Whether the glossary becomes more than a naming convention — for instance, driving cheap pattern matching for entities with stable surface forms — is a per-project judgment. Apply the same cost-accuracy logic as elsewhere: whatever method meets the accuracy threshold for the task at hand.
55
+ 术语表是否要承担命名约定之外的角色——例如,对表面形式稳定的实体直接驱动便宜的模式匹配——是逐项目判断的。在这里同样适用成本-准确率逻辑:在当前任务上能达到准确率阈值的方法就是合适的方法。如果术语表里某个实体的别名集合稳定且可枚举,那么基于术语表生成正则可能是性价比最高的方案;如果别名在新文档中持续扩展、随业务术语演化,那不如直接交给 worker LLM 做语义识别。术语表的另一个隐藏价值在于:它把命名约定固化成项目级单一事实来源,减少跨规则、跨技能之间因为称呼不同而产生的不一致问题。
58
56
 
59
- ## Schema Design
57
+ ## Schema 设计
60
58
 
61
- Define the expected output for each extraction. Keep it simple and JIT:
59
+ 为每次提取定义清晰的预期输出。保持简单、按需扩展(JIT 原则):
62
60
 
63
61
  ```json
64
62
  {
@@ -72,50 +70,62 @@ Define the expected output for each extraction. Keep it simple and JIT:
72
70
  }
73
71
  ```
74
72
 
75
- The schema should capture:
76
- - **value**: The extracted value, normalized.
77
- - **unit**: If applicable (%, 元, days, etc.).
78
- - **raw_text**: The original text fragment where the value was found. This is evidence for the judgment step.
79
- - **source_location**: Where in the document the value was found.
80
- - **confidence**: How sure you are (see `confidence-system`).
81
- - **extraction_method**: What extracted it (regex, LLM-TIER2, etc.).
73
+ Schema 通常需要包含以下信息:
74
+ - **value**:提取出的值,已经过归一化处理。
75
+ - **unit**:单位(如 %、元、天等),如果适用就填写。
76
+ - **raw_text**:值所在的原文片段。这是后续判定步骤的核心证据,也是出现争议时最容易回溯定位的字段。
77
+ - **source_location**:值在文档中的位置(章节号、表名、行列号等)。
78
+ - **confidence**:置信度,详见 `confidence-system`。
79
+ - **extraction_method**:使用的提取方法(regexLLM-TIER2 等),便于事后做方法效果分析。
82
80
 
83
- Do not over-engineer the schema. Add fields as needed during testing.
81
+ 不要过度设计 schema。最开始保持最小集合,在测试中遇到判定需要什么信息再补充对应字段;不要在第一次就把可能用到的字段全部塞进去。冗余字段不仅增加 prompt 体量、增加 worker LLM 的失误面,还会让后续维护时不清楚哪些字段是真正被消费的、哪些是历史残留。schema 一旦写错或扩张得太快,回头清理的成本会很高。
84
82
 
85
- ## Postprocessing
83
+ ## 后处理
86
84
 
87
- Raw extracted values often need normalization:
85
+ 提取出的原始值通常需要归一化才能与规则中的阈值或目标值做严格比较:
88
86
 
89
- - **Chinese numerals digits**: 一百二十万 → 1200000
90
- - **Date standardization**: 2024年3月15日 → 2024-03-15
91
- - **Unit conversion**: 万元 multiply by 10000 if comparing to a threshold in 元.
92
- - **Whitespace and noise removal**: Strip extra spaces, line breaks, formatting artifacts.
93
- - **Percentage normalization**: 0.125 → 12.5% or vice versa, depending on what the rule expects.
87
+ - **中文数字阿拉伯数字**:一百二十万 → 1200000
88
+ - **日期标准化**:2024年3月15日 → 2024-03-15
89
+ - **单位换算**:万元若规则的阈值以元为单位,需要乘以 10000 再比较。
90
+ - **空白与噪声清理**:去除多余空格、换行符、转义符、表格分隔符等格式残留。
91
+ - **百分比归一化**:0.125 → 12.5%,或反向转换,取决于规则期望的形式。
94
92
 
95
- Build postprocessing as Python functions in the rule skill's `scripts/` directory. They are deterministic and reusable.
93
+ 把后处理实现为规则技能 `scripts/` 目录下的 Python 函数。它们是确定性的、可复用的,且便于单元测试。提取与后处理分离也让 schema 中的 `raw_text` 保持忠实于原文,归一化后的值放进 `value`,两者各司其职。这种分层好处还在于:当后处理逻辑出现 bug 时,只要 `raw_text` 是对的,就可以重跑归一化而不必重新调用 LLM、节省成本。
96
94
 
97
- ## Confidence Annotation
95
+ ## 置信度标注
98
96
 
99
- Every extraction should carry a confidence estimate:
97
+ 每次提取都应当带上一个置信度估计,作为后续判定与汇报阶段的重要输入:
100
98
 
101
- - **Regex match, validated format**: 0.90-0.95
102
- - **LLM extraction, high certainty**: 0.80-0.85
103
- - **LLM extraction, some ambiguity**: 0.60-0.75
104
- - **Fallback or inferred value**: 0.40-0.60
105
- - **No value found**: 0.0 (flag as MISSING)
99
+ - **正则匹配,格式校验通过**:0.90-0.95
100
+ - **LLM 提取,高度确定**:0.80-0.85
101
+ - **LLM 提取,存在一定歧义**:0.60-0.75
102
+ - **回退或推断得到的值**:0.40-0.60
103
+ - **未找到值**:0.0(标记为 MISSING
106
104
 
107
- These are starting points. Calibrate based on actual accuracy (see `confidence-system`).
105
+ 以上只是起始值。随着 ground truth 累积,应根据实际准确率持续校准(详见 `confidence-system`)。低置信度的提取应在判定阶段被特别对待,例如触发人工复核或交叉比对,而不是直接当作高置信度结果使用。置信度本身不是装饰字段,而是判定阶段做风险加权的依据;如果整套流程对置信度毫无消费,那这个字段就形同虚设,反而会让团队对系统输出产生虚假的"完整感"。
108
106
 
109
- ## Prompt Design: Ask For What You Want
107
+ ## Prompt 设计:要什么,说什么
110
108
 
111
- Design prompts for what you want, not against what you don't want. "Don't include explanations" in a prompt is less reliable than stripping non-JSON text from the output in postprocessing. If you need to tell the LLM not to do something, use output filtering instead of prompt negation.
109
+ prompt 时要直接描述你想要的输出形态,而不是反复强调你不想要的内容。在 prompt 里写"不要包含解释"或"不要输出额外文本",远不如在后处理时从输出中剥离非 JSON 文本来得可靠——大模型在压力下经常会"为了帮助你"补充一些自以为有用的说明、致歉、或开场白,从而违反否定指令。如果确实必须告诉 LLM 不要做某事,那就把控制点放在后处理的输出过滤上,而不是 prompt 中的否定句。换言之:用确定性的后处理来兜底不确定的 LLM 行为,永远比单靠 prompt 措辞更稳。同理,"必须返回合法 JSON"也应当配套一段健壮的解析与修复逻辑,而不是天真地假设模型每次都能完美输出。
112
110
 
113
- ## Fitting Worker LLM Context
111
+ ## Worker LLM 上下文适配
114
112
 
115
- When designing extraction for worker LLM workflows:
113
+ worker LLM 工作流设计提取时,需要预先估算并约束上下文规模:
116
114
 
117
- 1. Calculate the prompt size: system prompt + instructions + examples + output format = N tokens.
118
- 2. Available context for document content = model's context window - N.
119
- 3. If the section exceeds available context, narrow further via tree processing.
120
- 4. Always leave room for the model's response.
121
- 5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
115
+ 1. 估算 prompt 体量:系统 prompt + 指令 + 示例 + 输出格式 = N tokens
116
+ 2. 留给文档内容的可用上下文 = 模型上下文窗口 - N
117
+ 3. 若目标章节超出可用上下文,回到树处理进一步收窄,或将其切分为多次调用。
118
+ 4. 始终为模型的响应预留足够空间,否则可能在生成中途被截断,导致 JSON 不完整。
119
+ 5. 用真实使用的模型做端到端测试以验证上下文确实能装下——编码 agent 估算的 token 数可能与 worker LLM 自己的分词器结果不一致,尤其是在中文、表格与代码混排的场景下,差异可能达到数十个百分点,仅凭估算容易在生产环境上线后才发现窗口被打爆。
120
+
121
+ ## 抽取也有边缘案例
122
+
123
+ 抽取**和判断同样重要**,对最终准确率的贡献不可低估。一个跨项目的经验:超过一半的最终错误其实可以追溯到抽取问题,而非判断问题 —— 抽取器返回了错值、错单位、或从错的章节取了内容,判官则忠实地从错的输入里得出了错的 verdict。
124
+
125
+ 把抽取按和判断同样的迭代纪律来做:
126
+
127
+ - **反思 / 迭代**:在样本集上跑过一次抽取器之后,回看失败的 case。是漏了某种模式(往 prompt 或正则里补)?是格式怪癖(单位换算、本地化)?还是文档类型问题(抽取器对 A 类对、对 B 类错)?
128
+ - **边缘案例登记**:当一个抽取失败没办法以合理代价改进标准抽取器时,把它登记到 `corner-case-management` 里 —— 注册表形状和判断边缘案例一样,resolution 类型换成 `code` / `prompt` / `parser` 级别的转换即可。
129
+ - **独立验证抽取器**:只在判断侧失败的端到端测试可能掩盖一个差劲的抽取器 —— 它的输出虽然不准,但碰巧 *大部分时候* 让判官得出了正确的 verdict。在 QC 复核里抽查的应当是抽取值本身,不只是最终 verdict。
130
+
131
+ 当你想通过调判官的 prompt 来提升准确率时,先检查抽取器是不是给判官喂了正确的输入。更便宜、更耐久的修复点几乎总在抽取器里。
@@ -1,62 +1,62 @@
1
- # Convergence Guide
1
+ # 收敛诊断指南
2
2
 
3
- Diagnostic procedures and real-world data for understanding when the evolution loop is converging, stalling, or regressing.
3
+ 用于判断演化循环是在收敛、停滞还是回退的诊断流程与真实数据。
4
4
 
5
- ## Empirical Data
5
+ ## 实证数据
6
6
 
7
- ### The Shiji Project — Event Dating Reflection
7
+ ### 史记项目——历史事件日期回顾
8
8
 
9
- A document verification project for historical event dating across regulatory filings. Five rounds of evolution:
9
+ 一个针对监管报送文档进行历史事件日期核查的项目。共进行了五轮演化:
10
10
 
11
- - **Round 1**: 1,010 corrections (first pass — many extraction and judgment errors across the board).
12
- - **Round 2**: 431 corrections (systematic fixes applied — regex patterns, prompt refinements).
13
- - **Round 3**: 465 corrections (regression — round 2 fix for date normalization introduced new failures on edge-case date formats).
14
- - **Round 4**: 167 corrections (stabilizing — round 3 regression diagnosed and resolved, remaining issues are corner cases).
15
- - **Round 5**: 46 corrections (converged — below 5% threshold, no new patterns, no regressions).
11
+ - **第 1 轮**:1,010 项修正(首轮——抽取与判定环节都有大量错误)。
12
+ - **第 2 轮**:431 项修正(系统性修复——调整正则模式、改写提示词)。
13
+ - **第 3 轮**:465 项修正(回退——第 2 轮的日期标准化修复在边界日期格式上引发了新的失败)。
14
+ - **第 4 轮**:167 项修正(趋于稳定——第 3 轮的回退已诊断并解决,剩余问题为长尾边界情况)。
15
+ - **第 5 轮**:46 项修正(已收敛——低于 5% 阈值,无新模式,无回退)。
16
16
 
17
- **Key insight**: The round 3 spike was the most informative event. It revealed that the round 2 fix was too aggressive — it normalized dates that should not have been normalized. Without convergence tracking, this regression might have been masked by overall accuracy still improving on other cases.
17
+ **核心洞察**:第 3 轮的反弹是最有价值的事件。它揭示第 2 轮的修复过激——把本不该标准化的日期也强行标准化了。如果不做收敛追踪,这一回退可能被整体准确率的提升所掩盖。
18
18
 
19
- ## Diagnostic Flowchart
19
+ ## 诊断流程图
20
20
 
21
- ### If correction volume increases between iterations:
21
+ ### 如果两次迭代之间的修正数量增加:
22
22
 
23
- 1. **Check for regression**: Are previously passing cases now failing? If yes, the last fix is the likely cause. Compare the diff between iterations.
24
- 2. **Check for fix conflicts**: Does the new fix contradict a prior fix? For example, broadening a regex in round N that was narrowed in round N-1.
25
- 3. **Check for test set changes**: Did new documents enter the test set between iterations? New documents can inflate correction volume without indicating regression.
23
+ 1. **检查是否出现回退**:是否有原本通过的案例现在不通过了?若是,则上一次修复很可能是元凶。比对两次迭代之间的 diff
24
+ 2. **检查修复之间是否冲突**:新的修复是否与之前的修复矛盾?例如第 N 轮放宽了在第 N-1 轮收紧的正则。
25
+ 3. **检查测试集是否变化**:两次迭代之间是否新增了文档?新文档可能在不代表回退的情况下抬高修正数量。
26
26
 
27
- ### If correction volume stays flat (not decreasing):
27
+ ### 如果修正数量持平(未下降):
28
28
 
29
- 1. **Check for oscillation**: Are the same cases flipping between pass and fail across iterations? This indicates the fix is unstable — it solves one variant but breaks another.
30
- 2. **Check if fix is too narrow**: The fix addresses the specific failing cases but does not generalize. The next iteration reveals similar cases the fix missed.
29
+ 1. **检查是否振荡**:是否有相同案例在不同迭代之间反复在通过与不通过之间切换?这说明修复不稳定——解决了一种变体却破坏了另一种。
30
+ 2. **检查修复是否过窄**:修复仅命中了具体失败的几条用例,没有泛化能力。下一次迭代会暴露出修复未覆盖的相似用例。
31
31
 
32
- ## False Convergence
32
+ ## 虚假收敛
33
33
 
34
- Metrics look stable but underlying issues are masked. The system appears converged but will fail on production data.
34
+ 指标看起来稳定,但底层问题被掩盖了。系统貌似已收敛,可一上生产数据就出问题。
35
35
 
36
- ### Common Causes
36
+ ### 常见成因
37
37
 
38
- - **Test set too small**: With fewer than 20 test cases, a single case changing can swing metrics by 5%. Convergence at this scale is statistically meaningless.
39
- - **Test set does not cover production variety**: The test set was curated from "clean" examples. Production documents include scanned PDFs, handwritten annotations, multi-language content, and formatting variations the test set never saw.
40
- - **Corner cases excluded from metrics**: If difficult cases are moved to `corner_cases.json` and excluded from accuracy calculation, the remaining "easy" cases converge quickly but the real problem is hidden.
38
+ - **测试集过小**:测试用例少于 20 条时,单条用例的状态切换就能让指标波动 5%。在这种规模下的收敛在统计上没有意义。
39
+ - **测试集未覆盖生产环境的多样性**:测试集是从"干净"样本中精选出来的。而生产文档包含扫描件 PDF、手写批注、多语种内容、各种格式变体——这些都未出现在测试集中。
40
+ - **难例被从指标中剔除**:如果将难例移入 `corner_cases.json` 并从准确率统计中排除,剩余的"简单"用例会很快收敛,但真正的问题被隐藏了。
41
41
 
42
- ### Detection
42
+ ### 检测方法
43
43
 
44
- Compare test set distribution to production distribution on key dimensions: document type, length, format, source. If they diverge significantly, convergence on the test set does not guarantee production quality.
44
+ 在文档类型、长度、格式、来源等关键维度上对比测试集与生产数据的分布。若两者偏差显著,测试集上的收敛并不能保证生产质量。
45
45
 
46
- ## Estimating Remaining Rounds
46
+ ## 估计剩余轮次
47
47
 
48
- ### Simple Heuristic
48
+ ### 简单启发式
49
49
 
50
- If corrections approximately halve each round, expect `log2(current_corrections / threshold)` more rounds.
50
+ 如果每轮修正数大致减半,预期还需要 `log2(当前修正数 / 阈值)` 轮。
51
51
 
52
- Example: current round has 200 corrections, threshold is 5% of 1000 cases = 50 corrections.
53
- - Estimated remaining rounds: log2(200/50) = log2(4) = 2 rounds.
52
+ 示例:当前轮有 200 项修正,阈值为 1000 个案例的 5% = 50 项修正。
53
+ - 预计剩余轮次:log2(200/50) = log2(4) = 2 轮。
54
54
 
55
- ### When the Heuristic Fails
55
+ ### 启发式失效时
56
56
 
57
- If corrections do not halve between rounds, the current approach may have hit its ceiling. Consider:
58
- - Escalating the fix strategy (prompt tweak logic rewrite architecture change).
59
- - Expanding the test set to reveal hidden patterns.
60
- - Consulting the developer user for domain insight on stubborn failures.
57
+ 如果每轮修正数没有减半,意味着当前方法可能已经触及天花板。可考虑:
58
+ - 升级修复策略(提示词微调逻辑重写架构调整)。
59
+ - 扩充测试集以暴露隐藏模式。
60
+ - 就顽固失败案例向开发者用户咨询领域见解。
61
61
 
62
- Do not grind through more iterations expecting different results. If three consecutive rounds show similar correction volumes, stop and reassess.
62
+ 不要寄望于继续硬磨多轮能换来不同结果。如果连续三轮的修正数都接近,停下来重新评估。
@@ -12,6 +12,20 @@ description: Design and execute quality control for production verification work
12
12
 
13
13
  质量监控的角色是「观察员」:用最少的复查量,维持对系统准确率的信心。当信心下降时,立即拉响警报、触发演化循环。
14
14
 
15
+ ## 与其他 skill 的协作
16
+
17
+ 质量监控是一组紧密协作的 skill 中的一员。不要把兄弟 skill 的内容搬过来在这里复述 —— 引用它即可。同一阶段同时加载的 skill 对 conductor 已经可见,在本 skill 里再注入一遍它们的材料,只会把两边都撑胖。
18
+
19
+ 各自的关系:
20
+
21
+ - `confidence-system` 定义置信度怎么合成、怎么校准。当 QC 用置信度来分流"哪些结果需要更多复核"时,它**消费**置信度 —— 但置信度的设计归在那边。
22
+ - `evolution-loop` 是把 QC 发现转化为改进的闭环机器。QC 产出信号(失败、漂移、反复出现的模式);evolution-loop 决定怎么处理这些信号。
23
+ - `corner-case-management` 是 QC 发现的异常的归宿。QC 揭示"这一份没合上";corner-case-management 判断它是该登记为边缘案例、还是系统性问题该上升到主流程、或者是数据质量问题需要升级。
24
+ - `cross-document-verification` 是另一类规则。QC 的工作是核查那类规则是否按设计在执行,而不是再讲一遍怎么构建它们。
25
+ - `dashboard-reporting` 是 QC 结果向开发者用户呈现的地方。QC 产数据,dashboard 来渲染。
26
+
27
+ 写作意义:如果你发现自己在本文件里写的东西更自然地归属于上面某个 skill,就在这里留一句指向("置信度的合成见 `confidence-system`"),把深度留在该去的地方。conductor 需要细节时,那个 skill 已经为它加载了。
28
+
15
29
  ## 五层质量保障架构
16
30
 
17
31
  质量控制不是单一活动——它由五个层级构成,逐层递进。低层级必须通过后,高层级才会执行。
@@ -248,6 +262,15 @@ logs/qc/
248
262
 
249
263
  发布 release 后,把终端用户引导到 release 包内的仪表盘,不是工作区的那个。工作区仪表盘是你自己的开发者视图。
250
264
 
265
+ ## 实质性变更后必须重新发布
266
+
267
+ release 包是某一时刻 `workflows/` 和 `rule_skills/` 的快照。如果在 release 构建之后修改了任何 `workflows/<rule>/workflow_v*.py`、`rule_skills/<id>/SKILL.md` 或 `check.py`,已发布的产物不再反映你的实际工作。引擎的里程碑推导会标记 `releaseIsStale: true` 并列出有差异的文件。
268
+
269
+ 触发后应当:
270
+ - **实质性变更**(新增混合路径、修正判定逻辑、新增规则):重新运行 `release` 工具生成新的包。
271
+ - **仅美化编辑**(错别字、注释、格式化):在 release 目录写入 `.accept_stale_release` 表示确认 —— `touch output/releases/<slug>/.accept_stale_release`。
272
+ - **不要**在 release 已经过时的情况下宣告 finalization 完成。下游消费者(其他 agent、部署的核查系统)读的是 release 包内的 `parser_v*.py` / `workflows/`,不是工作区。
273
+
251
274
  ## 开发者用户参与
252
275
 
253
276
  质量监控不应该让开发者用户去读 JSON 文件。通过仪表盘技能生成可视化报告,开发者用户只需要关注: