npm - kc-beta - Versions diffs - 0.2.1 → 0.3.0 - Mend

kc-beta 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

package/package.json +1 -1
package/src/agent/context.js +8 -4
package/src/agent/engine.js +65 -9
package/src/agent/pipelines/initializer.js +53 -8
package/src/agent/session-state.js +1 -0
package/src/agent/skill-loader.js +13 -1
package/src/agent/tools/document-parse.js +104 -21
package/src/agent/tools/document-search.js +24 -8
package/src/agent/tools/sandbox-exec.js +16 -5
package/src/agent/tools/workspace-file.js +47 -20
package/src/agent/workspace.js +24 -1
package/src/cli/components.js +8 -1
package/src/cli/config.js +100 -6
package/src/cli/index.js +14 -1
package/src/cli/onboard.js +70 -1
package/src/config.js +43 -3
package/src/model-tiers.json +153 -0
package/src/providers.js +63 -66
package/template/AGENT.md +20 -0
package/template/skills/en/meta/compliance-judgment/SKILL.md +10 -42
package/template/skills/en/meta/document-chunking/SKILL.md +32 -0
package/template/skills/en/meta/document-parsing/SKILL.md +11 -18
package/template/skills/en/meta/entity-extraction/SKILL.md +13 -28
package/template/skills/en/meta/tree-processing/SKILL.md +19 -1
package/template/skills/en/meta-meta/auto-model-selection/SKILL.md +53 -0
package/template/skills/en/meta-meta/pdf-review-dashboard/SKILL.md +57 -0
package/template/skills/en/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
package/template/skills/en/meta-meta/rule-extraction/SKILL.md +24 -1
package/template/skills/en/meta-meta/skill-authoring/SKILL.md +6 -0
package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +4 -0
package/template/skills/zh/meta/compliance-judgment/SKILL.md +41 -262
package/template/skills/zh/meta/document-chunking/SKILL.md +32 -0
package/template/skills/zh/meta/document-parsing/SKILL.md +65 -132
package/template/skills/zh/meta/entity-extraction/SKILL.md +68 -230
package/template/skills/zh/meta/tree-processing/SKILL.md +82 -194
package/template/skills/zh/meta-meta/auto-model-selection/SKILL.md +51 -0
package/template/skills/zh/meta-meta/pdf-review-dashboard/SKILL.md +55 -0
package/template/skills/zh/meta-meta/pdf-review-dashboard/scripts/generate_review.js +262 -0
package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +79 -164
package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +64 -185
package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +95 -216

package/template/skills/zh/meta/compliance-judgment/SKILL.md CHANGED Viewed

@@ -3,188 +3,33 @@ name: compliance-judgment
 description: Determine whether extracted entities comply with verification rules. Use after entity extraction to make the pass/fail judgment for each rule on each document. Covers translating natural language rules into executable logic, choosing between Python calculation and LLM semantic judgment, and producing actionable comments on failures. Also use when designing the judgment step of a workflow or when a rule's judgment logic needs debugging.
 ---
-# 合规判定
+# Compliance Judgment
-判定是核查流程的关键时刻。你已经提取到了实体值，你手里有规则要求。现在要回答一个问题：**合规还是不合规？**
+Judgment is the moment of truth. You have the extracted entity. You have the rule. Do they comply? The answer must be clear, correct, and — when the answer is no — accompanied by a concise, actionable comment.
-答案必须清晰、正确，并且在不合规时附带简洁、可操作的评论。
+## The Judgment Spectrum
-## 判定类型谱
+Rules range from trivially deterministic to deeply semantic. Pick the right tool for each rule.
-规则落在一个谱系上，从完全确定性到完全语义化：
+**Deterministic** — threshold checks, format validation, date arithmetic, cross-field consistency. Pure Python: free, instant, deterministic.
-```
-确定性判定（Python）◄───────────────────────►语义判定（LLM）
-阈值检查  格式验证  日期计算  交叉一致  充分性  完整性  一致性  模板合规
-```
-左侧的判定用代码解决——免费、即时、确定。右侧的判定需要语言理解——需要 LLM、有成本、有不确定性。大多数规则处于中间位置，需要混合方法。
-## 确定性判定（用 Python）
-规则有明确、可计算的标准时，用 Python 实现判定逻辑。
-### 阈值检查
-金融监管中最常见的判定类型。
-```python
-# 资本充足率 ≥ 8%（银保监会要求）
-result = "pass" if extracted_ratio >= 8.0 else "fail"
-comment = f"资本充足率为{extracted_ratio}%，低于监管最低要求8.0%" if result == "fail" else ""
-# 不良贷款率（通常监控 < 5%，但阈值因机构类型而异）
-result = "pass" if npl_ratio < threshold else "fail"
-# 拨备覆盖率 ≥ 150%
-result = "pass" if provision_coverage >= 150.0 else "fail"
-# 单一客户贷款集中度 ≤ 10%
-result = "pass" if single_exposure <= 10.0 else "fail"
-```
-注意事项：
-- 边界值的处理要明确：`>=` 还是 `>`？监管文件中"不低于"对应 `>=`，"低于"对应 `<`。
-- 浮点精度：用 `Decimal` 或设定合理的容差（如 0.01%）。金融数据通常精确到小数点后两位。
-### 格式验证
-```python
-import re
-# 贷款编号格式：XX-YYYY-ZZZZZZ
-result = "pass" if re.match(r"[A-Z]{2}-\d{4}-\d{6}", loan_number) else "fail"
-# 统一社会信用代码：18位
-result = "pass" if re.match(r"^[0-9A-Z]{18}$", uscc) else "fail"
-# 手机号格式
-result = "pass" if re.match(r"^1[3-9]\d{9}$", phone) else "fail"
-```
-### 日期计算
-```python
-from datetime import datetime, timedelta
-# 合同签署日期在申请日期30天内
-sign_date = datetime.strptime(extracted_sign_date, "%Y-%m-%d")
-app_date = datetime.strptime(extracted_app_date, "%Y-%m-%d")
-result = "pass" if (sign_date - app_date).days <= 30 else "fail"
-comment = f"签署日期{extracted_sign_date}距申请日期{extracted_app_date}为{(sign_date - app_date).days}天，超过30天限制" if result == "fail" else ""
-# 贷款到期日不早于合同约定
-result = "pass" if actual_maturity >= contracted_maturity else "fail"
-# 报告出具日期在报告期末后4个月内（年报要求）
-report_date = datetime.strptime(extracted_report_date, "%Y-%m-%d")
-period_end = datetime.strptime(extracted_period_end, "%Y-%m-%d")
-deadline = period_end + timedelta(days=120)  # 约4个月
-result = "pass" if report_date <= deadline else "fail"
-```
-### 交叉一致性检查
-```python
-# 合计数等于明细之和
-result = "pass" if abs(total - sum(items)) < 0.01 else "fail"
-comment = f"合计数{total}与明细之和{sum(items)}不一致，差额{total - sum(items)}" if result == "fail" else ""
-# 资产负债表平衡：资产 = 负债 + 所有者权益
-result = "pass" if abs(assets - liabilities - equity) < 0.01 else "fail"
-# 同一指标在不同章节的值一致
-result = "pass" if value_in_summary == value_in_detail else "fail"
-comment = f"摘要中为{value_in_summary}，明细中为{value_in_detail}" if result == "fail" else ""
-```
-确定性判定是首选。它们免费、即时、可复现。能用 Python 解决的判定，绝不调用 LLM。
-## 语义判定（用 LLM）
-规则需要语言理解时使用 LLM。
-### 充分性判定
-"风险披露是否充分描述了主要风险因素。"
-这无法用 Python 判定——"充分"是一个需要理解内容的语义概念。
-LLM 判定提示词设计要点：
-1. 提供规则全文（什么构成合规）。
-2. 提供提取的文档内容（文档实际说了什么）。
-3. 要求结构化输出：pass/fail、推理过程、评论。
-4. 要求保守判定——只在明确不合规时判 fail。真正模糊的情况用 uncertain。
-### 完整性判定
-"管理层讨论与分析是否涵盖了财务状况、经营成果和现金流量三个方面。"
-这是一个清单式的语义判定：内容是否覆盖了规定的多个主题。
-```
-请判定以下管理层讨论与分析是否涵盖以下三个必要主题：
-1. 财务状况分析
-2. 经营成果分析
-3. 现金流量分析
-文档内容：
-{extracted_section}
-对每个主题，判定是否有实质性讨论（不只是提及标题）。
-返回 JSON：
-{
-  "topic_1_covered": true/false,
-  "topic_2_covered": true/false,
-  "topic_3_covered": true/false,
-  "overall": "pass/fail",
-  "comment": "..."
-}
-```
-### 一致性判定
-"执行摘要与详细调查结果是否一致。"
-需要对两段文本进行语义比较，检查是否存在矛盾或遗漏。
-### 模板合规判定
+**Semantic** — adequacy, completeness, consistency, compliance with templates, detecting misleading or suggestive language, assessing whether a description is fair and balanced. These require language understanding — use worker LLM.
-"报告是否按照《XX管理办法》附件一的格式编写。"
+Many real compliance rules require semantic judgment. "The risk disclosure must adequately describe the key risks" cannot be checked with regex or Python. "The contract description must not be misleading or suggestive" requires deep language understanding. Use worker LLM for these without hesitation.
-需要将实际文档结构与模板要求进行逐项对比。
+Some rules combine both: extract a number (deterministic), compare to threshold (deterministic), then assess the explanation if borderline (semantic). The mix depends on the rule.
-## 混合判定
+The right method is whatever achieves accuracy at lowest cost. Simple threshold checks don't need LLM. Semantic assessments don't benefit from Python. Most projects will have a mix — let the nature of each rule determine the method.
-大多数规则实际上需要混合方法。先跑廉价的确定性步骤，必要时再调用 LLM。
+## Output Format
-### 示例：资本充足率核查
-```
-步骤1（正则提取）：提取"资本充足率"对应的数值 → 12.5%
-步骤2（Python判定）：12.5% >= 8.0% → pass
-```
-如果步骤 1 提取失败或置信度低：
-```
-步骤3（LLM提取）：请从以下内容中找出资本充足率的最新值 → 12.50%
-步骤4（Python判定）：12.50% >= 8.0% → pass
-```
-如果值在边界附近（如 8.02%）：
-```
-步骤5（LLM审查）：请确认12.50%是否为最终调整后的资本充足率，而非中间计算值
-```
-这个漏斗保证了：90% 的文档在步骤 2 就完成（零 LLM 成本），只有困难情况才调用 LLM。
-## 输出格式
-每条规则对每份文档的判定结果：
+For each rule × document combination:
 ```json
 {
   "rule_id": "R001",
-  "document": "bank_annual_report_2024.pdf",
-  "result": "pass",
+  "document": "report_2024_q1.pdf",
+  "result": "pass | fail | missing | error | uncertain",
   "extracted_value": "12.5%",
   "expected": ">= 8.0%",
   "comment": "",
@@ -192,112 +37,46 @@ LLM 判定提示词设计要点：
 }
 ```
-### result 取值说明
+**Result values:**
+- **pass**: Entity complies with the rule.
+- **fail**: Entity does not comply. Comment is required.
+- **missing**: The entity could not be found in the document. This is different from fail — the information is absent, not non-compliant.
+- **error**: Something went wrong during extraction or judgment (parsing failure, API error). Needs investigation.
+- **uncertain**: The judgment is ambiguous. May need human review.
-| 值 | 含义 | 评论要求 |
-|---|------|---------|
-| **pass** | 实体合规 | 通常无需评论 |
-| **fail** | 实体不合规 | **必须**附带评论 |
-| **missing** | 实体在文档中未找到 | 注明搜索范围 |
-| **error** | 提取或判定过程出错 | 注明错误类型 |
-| **uncertain** | 判定模糊，需人工审查 | 说明不确定原因 |
+**Design exit criteria first:** Before writing judgment logic for a rule, define the exit conditions: what constitutes pass, what constitutes fail, what triggers escalation to human, how to handle empty/missing values, what value ranges are valid. Explicit exit criteria prevent ambiguous or inconsistent judgment.
-**missing 与 fail 的区别至关重要**：missing 是提取层面的问题（信息不存在），fail 是判定层面的问题（信息存在但不合规）。混淆二者会导致错误的统计和错误的行动方向。
+**Prompt design:** Design prompts for what you want, not against what you don't want. "Don't include reasoning" is less reliable than extracting the verdict from structured output in postprocessing. Use output filtering instead of prompt negation.
-## 评论要求
+**Comments:**
+- Required only when result is `fail`. Skip for `pass` unless the developer user specifically requests pass comments.
+- Be concise and factual: "Capital adequacy ratio is 7.2%, below the regulatory minimum of 8.0%."
+- Do not editorialize: not "This is a serious violation that could result in penalties." Just state the facts.
+- Include the extracted value and the expected value/condition for context.
-评论是给人看的。它应该让审查人员一眼明白问题所在。
+### Lightweight Annotation Markup
-### 好的评论
-```
-"资本充足率为7.2%，低于监管最低要求8.0%。"
-"贷款合同签署日期2024-05-15距申请日期2024-03-01为75天，超过规定的30天期限。"
-"资产负债表不平衡：总资产1,234,567万元，负债+所有者权益为1,234,590万元，差额23万元。"
-"未在风险管理章节中找到流动性风险的专项讨论。"
-```
-### 不好的评论
-```
-"不合规。"  ← 没有具体信息
-"资本充足率不达标，存在重大风险隐患。"  ← 加了主观判断
-"该银行的资本充足率为7.2%，根据银保监会2023年发布的……（长篇大论）"  ← 过于冗长
-```
-### 评论原则
-- **简洁事实**：提取值 + 期望值 + 差异，三句话以内。
-- **仅在 fail 时给出**：pass 的结果不需要评论，除非开发者用户明确要求。
-- **不加主观判断**：不说"严重"、"重大"、"令人担忧"。只陈述事实。
-- **包含关键数值**：让审查人员无需回看原文就能理解问题。
-### 轻量标注格式
-为便于人工审查、节省 token 开销、以及在不同核查轮次之间做 diff 比较，判定结果也可以用紧凑的文本标注格式表达：
+For human review, token-efficient logging, and clean diff comparisons, results can also be expressed in compact text markup:
 ```
 [PASS] capital_adequacy <- 12.5% (>= 8.0%) | conf:0.95 | src:p3-s2
-[FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note:签署超期45天
-[MISSING] collateral_value | conf:0.60 | note:未在文档中找到担保物估值
-```
-此格式与上述 JSON 格式可无损互转。在以下场景中使用此格式：向开发者用户展示结果以便快速审阅、在演化迭代摘要中记录日志以节省 token、在核查轮次之间计算 diff。参见 `references/output-format.md` 获取完整的格式规范和转换规则。
-## 判定顺序
-有些规则之间存在依赖关系：
-- **条件依赖**：规则 B 只在规则 A 通过时适用。"如果借款人为新客户（规则 A），则需要额外的尽调文件（规则 B）。"
-- **值依赖**：规则 C 使用规则 A 计算的值。"风险加权资本比率（规则 A）决定了所需的拨备水平（规则 C）。"
-- **逻辑依赖**：规则 D 只在规则 A 和 B 都失败时才需要检查。
-在规则目录中标注这些依赖关系。按依赖顺序执行规则。将上游规则的结果作为下游规则的上下文传递。
-### 依赖图示例
-```
-R001（资本充足率提取） → R002（资本充足率阈值判定）
-R003（核心一级资本提取）→ R004（核心一级资本充足率计算）→ R002
-R001 + R005（杠杆率）→ R006（综合评级）
+[FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note:Signing overdue by 45 days
+[MISSING] collateral_value | conf:0.60 | note:Collateral valuation not found in document
 ```
-如果上游规则结果为 missing 或 error，下游规则也应标记为 error 或 unable_to_judge，而非强行判定。
-## 边缘情况处理
-### 空提取
-实体未找到。默认判定为 **missing**，而非 fail。缺失值是提取层面的问题，不是合规层面的问题。将其反馈给解析和提取步骤，可能需要升级解析器或调整提取策略。
-### 多值冲突
-文档中同一实体出现在多处，且值不一致。
-- 标记为 **uncertain**。
-- 在评论中列出所有找到的值及其来源位置。
-- 如果规则指定了优先来源（如"以审计报告中的数值为准"），使用该来源的值。
-### 条件规则
-"如果贷款金额超过 1000 万元，则需要提供担保。"
-- 先检查条件：贷款金额是否超过 1000 万？
-- 条件不满足 → 规则不适用 → 结果为 pass（或 not_applicable）。
-- 条件满足 → 继续检查后续要求。
+This format is losslessly convertible to and from the JSON format above. Use it when presenting results to the developer user for quick review, logging to evolution iteration summaries where token economy matters, or computing diffs between verification runs. See `references/output-format.md` for the full specification and conversion rules.
-### 否定规则
+## Judgment Ordering
-"文档中不应包含对关联方的担保承诺。"
+Some rules depend on the results of other rules:
+- Rule B might only apply if Rule A passes. "If the borrower is a new customer (Rule A), then additional documentation is required (Rule B)."
+- Rule C might use a value computed by Rule A. "The risk-weighted capital ratio (Rule A) determines the required reserve level (Rule C)."
-搜索"不存在"比搜索"存在"更难。策略：
-- 在文档中搜索关键词（"关联方"+"担保"+"承诺"）。
-- 如果找到匹配，提取上下文送 LLM 确认是否构成实际的担保承诺（可能只是声明"未提供担保"）。
-- 如果没有找到任何匹配，判定 pass，但置信度降低（因为搜索可能不完整）。
+Map these dependencies in the rule catalog. Execute rules in dependency order. Pass upstream results as context to downstream rules.
-### 数值精度问题
+## Handling Edge Cases
-金融数据经常面临精度问题：
-- 报告中写"12.5%"，但实际精确值可能是"12.4997%"。
-- 四舍五入导致的微小差异不应被判为 fail。
-- 在阈值比较中设定合理容差。如"资本充足率 >= 8%"，可设定容差为 0.05%，即 7.95% 以上都不直接判 fail，而是标记为 uncertain 并提请人工审查。
+- **Null extraction**: The entity was not found. Default to `missing`, not `fail`. A missing value is an extraction problem, not a compliance problem.
+- **Multiple values**: The document contains the entity in multiple places with different values. Flag as `uncertain`. Report all found values.
+- **Conditional rules**: "If the loan exceeds 1M, then collateral is required." Check the condition before applying the rule. If the condition is not met, the rule does not apply — result is `pass` (or `not_applicable` if you add that category).
+- **Negative results**: Some rules check for absence. "The document must NOT contain guarantees to related parties." Searching for absence is harder than searching for presence. Be thorough in the search, then be confident in the negative.

package/template/skills/zh/meta/document-chunking/SKILL.md ADDED Viewed

@@ -0,0 +1,32 @@
+---
+name: document-chunking
+description: >
+  Fast, cheap chunking for processing batches of sample and input documents.
+  Use when you need to split documents into manageable pieces for initial observation,
+  data sensibility checks, or feeding to extraction workflows. Not for production
+  verification chunking — for that, use tree-processing to design a tailored chunking script.
+---
+# Document Chunking
+Split documents into pieces for downstream processing. This is the fast, cheap version — for batch processing of samples and inputs, not for precision verification workflows.
+## Methods
+**Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
+**Fixed-size chunks** — split by character/token count with overlap. Good for search and initial observation. Typical: 2000-4000 chars with 200 char overlap.
+**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Use regex patterns for the document's header convention.
+## When to Use What
+Pick the simplest method that serves the task:
+- Batch document observation → page-level
+- Full-text search index → fixed-size with overlap
+- Section-level extraction → header-based
+- Table of contents available → parse TOC for structure
+## Relationship to tree-processing
+This skill is for quick, cheap chunking during exploration and batch processing. When you need production-grade chunking for verification workflows — where the chunking mechanism must be precise, consistent, and coded as a script — use `tree-processing` instead.

package/template/skills/zh/meta/document-parsing/SKILL.md CHANGED Viewed

@@ -3,166 +3,99 @@ name: document-parsing
 description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
 ---
-# 文档解析
+# Document Parsing
-解析是核查工作的地基。文本提取有误，后续所有判定都将失去意义。但解析同时也是成本中心——简单文本提取能解决的问题，绝不要动用视觉模型。
+Parsing is the foundation. If the text is wrong, everything downstream is wrong. But parsing is also a cost center — do not use expensive vision models when simple text extraction works.
-## 最小可用解析器原则
+## The Minimum Viable Parser Principle
-从最简单的解析器开始，仅在质量不达标时逐级升级。这不是为了省钱，而是因为简单解析器的失败模式更少、输出更稳定。复杂工具引入的变量越多，排查问题越困难。
+Start with the simplest parser. Escalate only when necessary. This is not about saving money — it is about producing the most reliable output. Simple parsers have fewer failure modes.
-把解析器想象成一架梯子：你需要够到的是那个高度，而不是梯子的最高一级。
+### Level 1: Direct Text Extraction
+- Tool: pdfjs-dist or similar PDF text extraction.
+- When: Well-formed digital PDFs with embedded text. This covers most modern business documents.
+- Output: Raw text with basic structure preserved (paragraphs, basic formatting).
+- Limitations: Tables may come out as messy text. Charts and images are invisible. Scanned PDFs produce nothing.
-### Level 1：直接文本提取
+### Level 2: Provider VLM (Vision Language Model)
+- Tool: VLM models from configured provider (VLM_TIER3 for cheap OCR, VLM_TIER1 for complex interpretation).
+- When: Level 1 produces garbled/incomplete text, scanned PDFs, image-based PDFs.
+- Output: Recognized text from page images, or structured interpretation (table as markdown, chart data as JSON).
+- Calling a provider VLM is more convenient and reliable than deploying local OCR. Use the cheapest VLM tier first; escalate to a more capable tier for complex tables/charts.
-- **工具**：pymupdf（PyMuPDF）或同类 PDF 文本提取库。
-- **适用场景**：内嵌文字层的数字原生 PDF。覆盖绝大多数现代金融文档——年报、招股说明书、贷款合同、监管报告。
-- **输出**：带基础段落结构的原始文本。
-- **局限**：表格可能被拆散为凌乱文本；图表和扫描页无法处理。
-- **成本**：零 API 调用，毫秒级速度。
+### Level 3: MineRU API or Local Tools (Optional)
+- Tool: MineRU API, pdfplumber, or locally deployed OCR — if configured.
+- When: Provider VLM is unavailable or too expensive for batch processing.
+- These are optional fallbacks. Most users will use Level 1 + Level 2.
-这是默认起点。只有当输出质量不合格时才考虑升级。
+## Quality Detection
-### Level 2：版面感知提取
+How to know when to escalate:
-- **工具**：pdfplumber 或同类版面感知解析器。
-- **适用场景**：Level 1 的表格输出混乱、多栏排版文档、表单类文档（贷款申请表、尽调清单等）。
-- **输出**：保留空间布局的文本，支持单元格级别的表格提取。
-- **局限**：仍基于文本层，无法处理扫描件。
-- **典型触发条件**：当 Level 1 提取的财务报表数字与列头错位、合并单元格导致数据串行时，升级到此级别。
+- **Low character count**: The document has pages but extracted text is very short. Likely a scanned PDF.
+- **Garbled text**: Unusual character sequences, encoding errors, or meaningless text patterns.
+- **Missing expected sections**: The table of contents mentions Chapter 5 but no Chapter 5 text was extracted.
+- **Table artifacts**: Columns of numbers without alignment, cell content mixed with headers, or table borders appearing as characters.
+- **Missing numbers in financial tables**: If a financial document's key metrics are not in the extracted text, the tables were probably not parsed.
-### Level 3：OCR 识别
+Write a quick quality check after parsing and before proceeding. If quality is insufficient, escalate to the next parser level.
-- **工具**：`.env` 中 `OCR_MODEL_TIER` 配置的视觉识别模型（PaddleOCR、GLM-4V 等）。
-- **适用场景**：扫描件 PDF、影印版监管文件、历史档案（2010年以前的银行文件很多是扫描件）。
-- **输出**：从图像中识别出的文字。
-- **局限**：速度慢、消耗 API 调用、可能引入识别错误（繁体/简体混淆、表格线干扰等）。
-- **注意事项**：OCR 对中文竖排文本、印章遮盖区域、手写批注的处理能力有限。遇到这些情况要做额外质量检查。
+### Parse Quality Score
-### Level 4：视觉模型解读
+Compute a quality score (0.0 to 1.0) from weighted heuristics to make escalation decisions systematic rather than ad-hoc. A recommended starting framework:
-- **工具**：高能力视觉模型（`OCR_MODEL_TIER1`）。
-- **适用场景**：
-  - 复杂表格：跨页表格、不规则合并单元格、嵌套表头（银行资本充足率报表常见此类结构）。
-  - 图表数据提取：柱状图、折线图、饼图中包含核查所需的关键数值。
-  - 混合排版：文字与图像交织的页面。
-- **输出**：对视觉内容的结构化解读（表格转 markdown、图表数据转 JSON）。
-- **局限**：成本高、速度慢。只在视觉内容确实需要语义理解时使用。
+- **Character density** (weight ~0.3): actual character count / expected characters for the document's page count. A 10-page PDF that yields only 200 characters likely failed.
+- **Garble ratio** (weight ~0.2): fraction of characters that are common CJK/Latin vs control characters, unusual sequences, or encoding artifacts.
+- **Section completeness** (weight ~0.3): if the document has a table of contents, what fraction of TOC entries have matching content in the extracted text?
+- **Table integrity** (weight ~0.2): for financial documents, are key numeric values that should appear in tables actually present in the extracted text?
-## 质量检测
+**Escalation thresholds** (recommended defaults — adjust freely):
+- Score >= 0.7: accept this parser level, proceed to downstream processing.
+- Score 0.4-0.7: escalate to the next parser level, re-parse, re-score.
+- Score < 0.4: skip directly to Level 3 (OCR) or Level 4 (vision) depending on document characteristics.
-解析完成后，不要直接进入下一步。先跑一遍质量检查，判断是否需要升级解析器。
+**Lock-in**: once a parser level produces an acceptable score for a document type, record that level. Do not re-evaluate unless a downstream verification failure is traced back to a parsing issue.
-### 检测指标
+These weights, thresholds, and the scoring approach itself are starting points. The coding agent should design whatever quality assessment works for the specific document types at hand — a simple pass/fail heuristic may be sufficient for some scenarios; a more nuanced scoring function may be needed for others. The important pattern is: **measure quality → compare to threshold → decide whether to escalate**.
-- **字符数过少**：文档有 200 页但提取文本不到 5000 字——大概率是扫描件，Level 1 只拿到了页眉页脚。
-- **乱码检测**：出现大量连续非常用字符、编码错误符号（□、■、?）、或无意义字符序列。常见于编码不匹配或字体嵌入异常的 PDF。
-- **章节缺失**：目录显示有"第五章 风险管理"，但提取文本中找不到对应内容。可能该章节是扫描插页或图片格式。
-- **表格异常**：
-  - 数字列缺少对齐，数值与表头无法对应。
-  - 单元格内容与相邻单元格混合。
-  - 表格线字符（|、+、-）出现在文本中。
-  - 关键财务数据缺失（资本充足率、不良贷款率、净利润等数字在文本中找不到）。
-- **页码断裂**：连续页码中有跳跃，说明某些页面可能未被提取。
+This follows the same tier-transition pattern as model tier selection in `skill-to-workflow`: a quality/accuracy score drives the decision to stay, escalate, or skip tiers.
-### 质量检查流程
+## Table Handling
-```
-解析完成 → 检查字符数 → 检查乱码比例 → 检查章节完整性 → 检查关键表格
-    ↓ 任一项不合格
-升级到下一级解析器 → 重新解析 → 再次检查
-```
+Tables are critical in financial documents (balance sheets, ratio tables, compliance metrics). They deserve special attention:
-在工作流中实现此逻辑时，记录每次升级的原因（哪个指标触发了升级）。这些日志对演进循环有价值。
+1. **Detection**: Identify table regions. Look for grid patterns, consistent column spacing, or explicit table markers.
+2. **Extraction**: Extract cell-by-cell content. Preserve the row-column relationship.
+3. **Reconstruction**: Convert to a structured format (markdown table, JSON array of rows, or CSV).
+4. **Validation**: Spot-check that key values in the reconstructed table match what is visible in the document.
-### 解析质量评分
+When the standard parser fails on tables, try the vision model approach: send the table image (cropped from the PDF page) to a vision model and ask it to produce a markdown table.
-将上述检测指标量化为一个综合评分（0.0–1.0），让升级决策从主观判断变为系统化流程。
+## Chart Handling
-**推荐信号与参考权重：**
-- **字符密度**（~0.3）：实际提取字符数 / 按页数估算的预期字符数。远低于预期说明大量内容未被提取。
-- **乱码比例**（~0.2）：常用字符占比与异常序列占比的对比。编码问题在此暴露。
-- **章节完整性**（~0.3）：目录条目在正文中有对应内容的比例。缺失章节是解析失败的强信号。
-- **表格完整性**（~0.2）：关键数值（如总资产、净利润、资本充足率）在提取文本中是否可检索到。
+Charts (bar charts, line charts, pie charts) occasionally contain data needed for verification:
-**升级阈值（推荐默认值）：**
-- ≥ 0.7：接受当前解析器级别，进入下一步。
-- 0.4–0.7：升级一级解析器，重新解析后再评分。
-- < 0.4：跳过中间级别，直接使用 OCR 或视觉模型。
+- Extract the chart image from the document.
+- Send to a vision model with a prompt: "Extract the data points, labels, and values from this chart. Return as a JSON array."
+- Validate the extracted data against any nearby text or table that might contain the same numbers.
-**锁定机制：** 一旦评分达标，记录当前解析器级别。仅在下游核查失败且回溯至解析质量时重新评估，避免反复试探。
+This is expensive. Only do it when a verification rule specifically requires data from a chart and that data is not available in text elsewhere in the document.
-**重要提示：** 以上权重、阈值和评分方式本身都是起点，不是定论。编程智能体应根据实际文档特征自由调整、增删参数。真正重要的是这个框架——度量质量 → 对比阈值 → 做出决策——而非具体公式。公式会随着业务数据的积累不断演化。
+## Output Format
-这套"评分 → 阈值 → 分级处理"的模式与 `skill-to-workflow` 中的模型层级选择逻辑完全同构。如果你已经理解了模型层级的逐级升级机制，这里的解析器升级遵循相同范式。
+Parsed documents should be saved as clean markdown:
-## 表格处理
+- Preserve the document's heading hierarchy (# Chapter, ## Section, ### Subsection).
+- Preserve lists, numbered or bulleted.
+- Convert tables to markdown table format.
+- Note page boundaries if relevant (some rules reference specific pages).
+- Strip noise: headers, footers, page numbers, watermarks (unless a rule specifically checks for them).
-金融文档的核心信息大量存在于表格中：资产负债表、利润表、资本充足率明细表、贷款五级分类表、关联交易汇总表。表格处理不好，核查就无法开展。
+Save parsed output alongside the original document for reuse across rules.
-### 四步流程
+## Caching
-1. **检测**：识别表格区域。寻找网格模式、一致的列间距、或显式的表格标记。对金融文档而言，数字密集且纵向对齐的区域几乎都是表格。
-2. **提取**：逐单元格提取内容。关键是保持行列关系——第三行第二列的数字必须对应正确的行标题和列标题。
-   - 常见陷阱：合并单元格导致行列错位；跨页表格的表头在第一页、数据在第二页；千分位逗号与单元格分隔符混淆。
-3. **重建**：转换为结构化格式。
-   - 首选 markdown 表格（人可读、LLM 可理解）。
-   - 复杂表格可用 JSON 行数组（便于程序处理）。
-   - 保留原始表头层级（如"期末余额"下分"本期"和"上期"两个子列）。
-4. **验证**：抽检重建后的表格与原文档是否一致。
-   - 选取 3-5 个关键数值，对照原 PDF 页面确认。
-   - 检查行数和列数是否匹配。
-   - 验证合计行是否等于明细行之和（财务报表通常有此约束）。
-### 表格提取失败时
-当 Level 1-2 无法正确提取表格：
-- 从 PDF 中裁剪表格区域的图片。
-- 发送给视觉模型，提示词要求输出 markdown 表格。
-- 对视觉模型的输出做与上述相同的验证步骤。
-不要因为一页表格提取失败就对整份文档使用 Level 4。只对出问题的表格页面升级。
-## 图表处理
-图表（柱状图、折线图、饼图、散点图）偶尔包含核查所需的数据：
-- 从文档中提取图表图片（按页面或按区域裁剪）。
-- 发送给视觉模型，提示词示例：
-  ```
-  请提取此图表中的所有数据点、标签和数值。
-  返回 JSON 数组格式，每个元素包含 label 和 value 字段。
-  如有多个系列，请分别标注系列名称。
-  ```
-- 将提取的数据与文档中其他位置的文本或表格交叉验证——图表的数据通常在正文或附表中也能找到。
-这是高成本操作。只在核查规则明确要求图表中的数据、且该数据无法从文本中获取时才执行。
-## 输出格式
-解析后的文档应保存为干净的 markdown 文件：
-- **保留标题层级**：`# 第一章 总则`、`## 第一节 定义`、`### 一、适用范围`。与原文档的层级结构一一对应。
-- **保留列表**：有序列表和无序列表保持原有编号方式。
-- **表格转换**：转为 markdown 表格格式。复杂表格保留足够的上下文说明。
-- **页码标注**：在页面边界处标注 `<!-- Page X -->`。部分核查规则引用特定页码。
-- **清除噪声**：页眉、页脚、页码、水印一律去除（除非某条规则专门检查这些内容）。
-- **保留原文措辞**：不要改写原文语句。解析是忠实转录，不是翻译或摘要。
-文件命名建议：原文件名加 `.parsed.md` 后缀，存放在同一目录下。
-## 缓存与复用
-解析是耗时操作（尤其 Level 3-4），必须缓存结果以避免重复劳动：
-- 将解析后的 markdown 文件保存在原文件旁边，供所有规则复用。
-- 记录解析器级别：在 markdown 文件开头或配套的元数据文件中注明使用了哪个级别的解析器。
-- 仅在以下情况重新解析：
-  - 原始文件被替换或更新。
-  - 某条规则的核查失败被追溯到解析质量问题，需要升级解析器。
-  - 缓存文件损坏或丢失。
-跨规则共享解析结果是效率的关键。一份 300 页的年报可能被 50 条规则引用——解析一次，使用 50 次。
+Parsing is expensive (especially Level 3-4). Cache parsed output:
+- Store the parsed markdown alongside the original file.
+- Track which parser level produced it.
+- Re-parse only when: the original file changes, a rule requires higher-quality parsing than what is cached, or a verification failure is traced back to a parsing issue.