kc-beta 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/kc-beta.js +16 -0
- package/package.json +32 -0
- package/src/agent/confidence-scorer.js +120 -0
- package/src/agent/context.js +124 -0
- package/src/agent/corner-case-registry.js +119 -0
- package/src/agent/engine.js +224 -0
- package/src/agent/events.js +27 -0
- package/src/agent/history.js +101 -0
- package/src/agent/llm-client.js +131 -0
- package/src/agent/pipelines/base.js +14 -0
- package/src/agent/pipelines/distillation.js +113 -0
- package/src/agent/pipelines/extraction.js +92 -0
- package/src/agent/pipelines/index.js +23 -0
- package/src/agent/pipelines/initializer.js +163 -0
- package/src/agent/pipelines/production-qc.js +99 -0
- package/src/agent/pipelines/skill-authoring.js +83 -0
- package/src/agent/pipelines/skill-testing.js +111 -0
- package/src/agent/tools/agent-tool.js +100 -0
- package/src/agent/tools/base.js +35 -0
- package/src/agent/tools/dashboard-render.js +146 -0
- package/src/agent/tools/document-parse.js +184 -0
- package/src/agent/tools/document-search.js +111 -0
- package/src/agent/tools/evolution-cycle.js +150 -0
- package/src/agent/tools/qc-sample.js +94 -0
- package/src/agent/tools/registry.js +55 -0
- package/src/agent/tools/rule-catalog.js +113 -0
- package/src/agent/tools/sandbox-exec.js +106 -0
- package/src/agent/tools/tier-downgrade.js +114 -0
- package/src/agent/tools/worker-llm-call.js +109 -0
- package/src/agent/tools/workflow-run.js +138 -0
- package/src/agent/tools/workspace-file.js +122 -0
- package/src/agent/version-manager.js +130 -0
- package/src/agent/workspace.js +82 -0
- package/src/cli/components.js +164 -0
- package/src/cli/index.js +329 -0
- package/src/cli/init.js +80 -0
- package/src/cli/onboard.js +182 -0
- package/src/cli/terminal.js +143 -0
- package/src/config.js +93 -0
- package/template/.env.template +31 -0
- package/template/CLAUDE.md +137 -0
- package/template/Input/.gitkeep +0 -0
- package/template/Output/.gitkeep +0 -0
- package/template/Rules/.gitkeep +0 -0
- package/template/Samples/.gitkeep +0 -0
- package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
- package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
- package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
- package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
- package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
- package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
- package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
- package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
- package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
- package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
- package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
- package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
- package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
- package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
- package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
- package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
- package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
- package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
- package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/en/skill-creator/LICENSE.txt +202 -0
- package/template/skills/en/skill-creator/SKILL.md +479 -0
- package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/en/skill-creator/agents/comparator.md +202 -0
- package/template/skills/en/skill-creator/agents/grader.md +223 -0
- package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/en/skill-creator/references/schemas.md +430 -0
- package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/en/skill-creator/scripts/utils.py +47 -0
- package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
- package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
- package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
- package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
- package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
- package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
- package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
- package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
- package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
- package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
- package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
- package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
- package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
- package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
- package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
- package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
- package/template/skills/zh/skill-creator/SKILL.md +479 -0
- package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
- package/template/skills/zh/skill-creator/agents/grader.md +223 -0
- package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/zh/skill-creator/references/schemas.md +430 -0
- package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: quality-control
|
|
3
|
+
description: Design and execute quality control for production verification workflows. Use when workflows are deployed on Input/ documents and results need to be monitored, when designing the QC sampling strategy for a rule, or when evaluating whether monitoring can be reduced. Covers LLM-as-Judge evaluation, adaptive sampling strategies, confidence-based triage, and the transition from active monitoring to stable oversight. Also use when production quality drops and you need to diagnose whether to trigger the evolution loop.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# 生产环境的质量监控与质控策略
|
|
7
|
+
|
|
8
|
+
## 质量监控的定位
|
|
9
|
+
|
|
10
|
+
工作流部署到生产环境后,不能放任不管。但也不可能对每一份单据的核查结果都做人工复查——那就失去了自动化的意义。
|
|
11
|
+
|
|
12
|
+
质量监控的角色是「观察员」:用最少的复查量,维持对系统准确率的信心。当信心下降时,立即拉响警报、触发演化循环。
|
|
13
|
+
|
|
14
|
+
## 五层质量保障架构
|
|
15
|
+
|
|
16
|
+
质量控制不是单一活动——它由五个层级构成,逐层递进。低层级必须通过后,高层级才会执行。
|
|
17
|
+
|
|
18
|
+
| 层级 | 名称 | 检查内容 | 方法 |
|
|
19
|
+
|------|------|---------|------|
|
|
20
|
+
| L1 | 文本完整性 | 文件存在、编码正确、处理后源文本保持完整 | 脚本(`lint_*`) |
|
|
21
|
+
| L2 | 语法 | 输出格式有效(JSON/CSV)、必填字段存在、类型正确 | 脚本(`lint_*`) |
|
|
22
|
+
| L3 | 数据完备性 | 必填字段已填充、值在有效范围内(日期是日期、金额为正数) | 脚本(`validate_*`) |
|
|
23
|
+
| L4 | 业务逻辑 | 跨字段一致性、阈值合规性、序列合理性 | 脚本 + LLM |
|
|
24
|
+
| L5 | 跨阶段 | 结果中的实体与提取输出匹配、规则与目录匹配、工作流输出与技能基准真值匹配 | 脚本(`cross_validate_*`)+ LLM |
|
|
25
|
+
|
|
26
|
+
**核心原则:**
|
|
27
|
+
- **快速失败**:如果 L1 失败(文件缺失),不要运行 L4(业务逻辑)。低层级阻塞高层级。
|
|
28
|
+
- **代码优先**:L1-L3 应为纯代码——低成本且确定性强。LLM 评审(见下文)在 L4 和 L5 层级运作。
|
|
29
|
+
- **命名规范**:`lint_*` 用于 L1-L2,`validate_*` 用于 L3-L4,`cross_validate_*` 用于 L5。
|
|
30
|
+
|
|
31
|
+
**质控 vs 反思**:质控发现输出中的问题(本技能)。反思诊断问题的根因并修复(参见 `evolution-loop`)。质控向反思提供数据;反思向系统反馈修复。
|
|
32
|
+
|
|
33
|
+
详见 `references/qa-layers.md` 的层级规格和示例模式。
|
|
34
|
+
|
|
35
|
+
## LLM 作为评审(LLM-as-Judge)
|
|
36
|
+
|
|
37
|
+
质控的核心机制是用编程智能体(你)或指定的高层级模型,对工作流的输出结果进行独立评审。
|
|
38
|
+
|
|
39
|
+
### 评审判定等级
|
|
40
|
+
|
|
41
|
+
每条核查结果的评审结论分为四个等级:
|
|
42
|
+
|
|
43
|
+
| 等级 | 含义 | 后续处理 |
|
|
44
|
+
|------|------|---------|
|
|
45
|
+
| **correct(正确)** | 工作流判定完全正确,字段提取准确,结论有据 | 无需处理 |
|
|
46
|
+
| **partial(部分正确)** | 核心结论正确,但细节有偏差(如字段提取不完整、批注不够精确) | 记录偏差,低优先级修复 |
|
|
47
|
+
| **incorrect(错误)** | 核查结论错误(漏报或误报) | 触发演化循环 |
|
|
48
|
+
| **missing(缺失)** | 工作流未能给出核查结论(异常退出、超时、格式错误) | 检查工作流健壮性 |
|
|
49
|
+
|
|
50
|
+
### 字段级评审
|
|
51
|
+
|
|
52
|
+
除了整体判定之外,对关键字段逐一评审:
|
|
53
|
+
|
|
54
|
+
```json
|
|
55
|
+
{
|
|
56
|
+
"document_id": "DOC-2025-0042",
|
|
57
|
+
"rule_id": "R001",
|
|
58
|
+
"workflow_verdict": "pass",
|
|
59
|
+
"judge_verdict": "correct",
|
|
60
|
+
"field_review": {
|
|
61
|
+
"invoice_date": {"extracted": "2025-03-15", "judge": "correct"},
|
|
62
|
+
"contract_start": {"extracted": "2025-01-01", "judge": "correct"},
|
|
63
|
+
"contract_end": {"extracted": "2025-12-31", "judge": "correct"}
|
|
64
|
+
},
|
|
65
|
+
"comment": "",
|
|
66
|
+
"reviewed_at": "2025-04-01T16:00:00Z"
|
|
67
|
+
}
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
字段级评审能帮助发现「歪打正着」的情况——结论碰巧正确但提取过程有误,这种隐患在未来的案例中可能导致错误。
|
|
71
|
+
|
|
72
|
+
## 自适应抽样策略
|
|
73
|
+
|
|
74
|
+
不是每一份单据都需要复查。抽样比例根据工作流的历史表现动态调整。
|
|
75
|
+
|
|
76
|
+
### 抽样比例阶梯
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
初始部署期:100% 全量复查
|
|
80
|
+
↓ 连续 2 批次准确率 ≥ 阈值
|
|
81
|
+
稳定期初期:50% 抽样
|
|
82
|
+
↓ 连续 3 批次准确率 ≥ 阈值
|
|
83
|
+
稳定期中期:20% 抽样
|
|
84
|
+
↓ 连续 5 批次准确率 ≥ 阈值
|
|
85
|
+
长期稳态:5-10% 抽样
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**回退机制**:任何一个批次出现准确率下降,抽样比例立即回退一级。连续两个批次下降,回退到 100%。
|
|
89
|
+
|
|
90
|
+
### .env 中的 MONITOR_FREQUENCY 映射
|
|
91
|
+
|
|
92
|
+
```
|
|
93
|
+
MONITOR_FREQUENCY=high → 初始部署期,100% 全量复查
|
|
94
|
+
MONITOR_FREQUENCY=mid → 稳定期,50% 抽样
|
|
95
|
+
MONITOR_FREQUENCY=low → 长期稳态,10% 抽样
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
这个参数是初始值。系统运行后会根据实际表现自动调整。
|
|
99
|
+
|
|
100
|
+
### 抽样方法
|
|
101
|
+
|
|
102
|
+
抽样不是简单的随机抽取。采用分层抽样确保覆盖面:
|
|
103
|
+
|
|
104
|
+
1. **按置信度分层**:优先复查低置信度的案例
|
|
105
|
+
2. **按单据类型分层**:确保每种单据类型都有样本被复查
|
|
106
|
+
3. **按判定结果分层**:不通过的案例优先复查(误报成本高于漏报)
|
|
107
|
+
4. **随机保底**:即使高置信度案例也有一定概率被抽中
|
|
108
|
+
|
|
109
|
+
## 基于置信度的分诊
|
|
110
|
+
|
|
111
|
+
工作流输出的置信度(confidence)是分诊的重要依据:
|
|
112
|
+
|
|
113
|
+
### 高置信度(≥ 0.9)
|
|
114
|
+
|
|
115
|
+
工作流对自己的判定非常确定。
|
|
116
|
+
|
|
117
|
+
处理方式:抽检即可。在稳定期,只随机抽取 5-10% 进行复查。
|
|
118
|
+
|
|
119
|
+
### 中等置信度(0.7 - 0.9)
|
|
120
|
+
|
|
121
|
+
工作流有一定把握但不完全确定。
|
|
122
|
+
|
|
123
|
+
处理方式:加大抽样比例,至少 30-50%。关注置信度居中的案例是否存在系统性偏差。
|
|
124
|
+
|
|
125
|
+
### 低置信度(< 0.7)
|
|
126
|
+
|
|
127
|
+
工作流对自己的判定没有信心。
|
|
128
|
+
|
|
129
|
+
处理方式:全量复查。低置信度案例本身就是有价值的数据——它们往往是边界案例或新场景,可以丰富测试集。
|
|
130
|
+
|
|
131
|
+
## 触发演化循环的条件
|
|
132
|
+
|
|
133
|
+
质量监控发现以下情况时,需要触发演化循环:
|
|
134
|
+
|
|
135
|
+
### 准确率下降
|
|
136
|
+
|
|
137
|
+
```
|
|
138
|
+
IF 当前批次准确率 < WORKFLOW_ACCURACY:
|
|
139
|
+
触发演化循环
|
|
140
|
+
抽样比例回退到 100%
|
|
141
|
+
通知开发者用户
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### 新的失败模式
|
|
145
|
+
|
|
146
|
+
即使整体准确率达标,如果发现了之前未见过的失败类型,也需要启动演化循环进行调查。
|
|
147
|
+
|
|
148
|
+
### 置信度漂移
|
|
149
|
+
|
|
150
|
+
工作流的平均置信度持续下降,即使判定结果仍然正确。这可能预示即将出现准确率问题,需要提前介入。
|
|
151
|
+
|
|
152
|
+
### 分布漂移
|
|
153
|
+
|
|
154
|
+
输入单据的特征分布发生变化(新的单据格式、新的业务场景),即使当前准确率不受影响,也需要评估是否需要补充测试覆盖。
|
|
155
|
+
|
|
156
|
+
## 批量处理的质控流程
|
|
157
|
+
|
|
158
|
+
当 `Input/` 目录中有批量单据需要处理时,按以下流程执行:
|
|
159
|
+
|
|
160
|
+
### 处理流程
|
|
161
|
+
|
|
162
|
+
```
|
|
163
|
+
1. 扫描 Input/ 目录,统计待处理单据数量和类型
|
|
164
|
+
2. 按规则逐条执行工作流
|
|
165
|
+
3. 将结果写入 Output/ 目录
|
|
166
|
+
4. 根据当前抽样比例,选取样本进行质控评审
|
|
167
|
+
5. 汇总评审结果
|
|
168
|
+
6. 判断是否需要触发演化循环
|
|
169
|
+
7. 生成质控报告
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### 输出结构
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
Output/
|
|
176
|
+
├── DOC-001/
|
|
177
|
+
│ ├── results.json # 所有规则的核查结果
|
|
178
|
+
│ └── qc_review.json # 质控评审结果(如被抽中)
|
|
179
|
+
├── DOC-002/
|
|
180
|
+
│ └── results.json
|
|
181
|
+
└── batch_summary.json # 本批次汇总
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### batch_summary.json
|
|
185
|
+
|
|
186
|
+
```json
|
|
187
|
+
{
|
|
188
|
+
"batch_id": "BATCH-2025-04-01",
|
|
189
|
+
"total_documents": 50,
|
|
190
|
+
"processed": 50,
|
|
191
|
+
"rules_applied": ["R001", "R002", "R003"],
|
|
192
|
+
"qc_sample_size": 25,
|
|
193
|
+
"qc_sample_rate": 0.5,
|
|
194
|
+
"qc_results": {
|
|
195
|
+
"correct": 23,
|
|
196
|
+
"partial": 1,
|
|
197
|
+
"incorrect": 1,
|
|
198
|
+
"missing": 0
|
|
199
|
+
},
|
|
200
|
+
"accuracy": 0.92,
|
|
201
|
+
"threshold": 0.85,
|
|
202
|
+
"status": "above_threshold",
|
|
203
|
+
"action": "none",
|
|
204
|
+
"timestamp": "2025-04-01T18:00:00Z"
|
|
205
|
+
}
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
## 质控日志
|
|
209
|
+
|
|
210
|
+
所有质控活动记录在 `logs/qc/` 目录下:
|
|
211
|
+
|
|
212
|
+
```
|
|
213
|
+
logs/qc/
|
|
214
|
+
├── qc_2025-04-01.json # 每日质控汇总
|
|
215
|
+
├── reviews/ # 逐案评审记录
|
|
216
|
+
│ ├── DOC-001_R001.json
|
|
217
|
+
│ └── DOC-001_R002.json
|
|
218
|
+
└── trends.json # 准确率趋势数据
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
### trends.json
|
|
222
|
+
|
|
223
|
+
追踪关键指标的时间序列,供仪表盘展示使用:
|
|
224
|
+
|
|
225
|
+
```json
|
|
226
|
+
{
|
|
227
|
+
"R001": {
|
|
228
|
+
"history": [
|
|
229
|
+
{"date": "2025-03-28", "accuracy": 0.88, "sample_rate": 1.0, "batch_size": 20},
|
|
230
|
+
{"date": "2025-03-29", "accuracy": 0.90, "sample_rate": 1.0, "batch_size": 15},
|
|
231
|
+
{"date": "2025-03-30", "accuracy": 0.93, "sample_rate": 0.5, "batch_size": 30},
|
|
232
|
+
{"date": "2025-04-01", "accuracy": 0.92, "sample_rate": 0.5, "batch_size": 50}
|
|
233
|
+
]
|
|
234
|
+
}
|
|
235
|
+
}
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
## 开发者用户参与
|
|
239
|
+
|
|
240
|
+
质量监控不应该让开发者用户去读 JSON 文件。通过仪表盘技能生成可视化报告,开发者用户只需要关注:
|
|
241
|
+
|
|
242
|
+
- 当前各规则的准确率状态(绿色/黄色/红色)
|
|
243
|
+
- 是否有需要人工确认的模糊案例
|
|
244
|
+
- 成本趋势是否合理
|
|
245
|
+
- 是否有需要决策的上报事项
|
|
246
|
+
|
|
247
|
+
质控发现的问题,如果属于系统可自行修复的范围(通过演化循环),不需要打扰开发者用户。只有以下情况需要上报:
|
|
248
|
+
|
|
249
|
+
- 准确率持续下降且演化循环未能修复
|
|
250
|
+
- 发现了新的业务场景需要确认核查规则
|
|
251
|
+
- 成本异常波动
|
|
252
|
+
- 达到 `MAX_ITERATIONS` 仍未解决的问题
|
|
253
|
+
|
|
254
|
+
## 用户反馈收集
|
|
255
|
+
|
|
256
|
+
在你创建的每一个核查应用中,都要构建错误报告和评论机制。这不是可选功能——它们是必要的数据来源。
|
|
257
|
+
|
|
258
|
+
### 两类受众
|
|
259
|
+
|
|
260
|
+
- **开发者用户**:技术性错误报告——字段级更正、规则重新评估请求、附带上下文的误报/漏报标记。他们可以看到完整的结果细节。
|
|
261
|
+
- **终端用户**:简化反馈——标记结果为错误、添加评论、标注严重程度。他们看到的是简洁的界面,不涉及技术内部细节。
|
|
262
|
+
|
|
263
|
+
### 反馈作为基准真值
|
|
264
|
+
|
|
265
|
+
当用户对核查结果报告错误时,该更正即为基准真值。它优先于编程智能体的判断和 Worker LLM 的输出。将用户更正立即注入 `evolution-loop`,作为已确认的失败案例——其优先级高于智能体自行检测到的问题。
|
|
266
|
+
|
|
267
|
+
### 反馈数据流
|
|
268
|
+
|
|
269
|
+
通过仪表盘收集反馈(参见 `dashboard-reporting`)→ 存储为结构化记录(result_id, reporter_role, feedback_type, corrected_value, comment, timestamp)→ 注入 `evolution-loop` 作为回归测试案例 → 在质控指标中追踪更正趋势。
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# QA Layer Specifications
|
|
2
|
+
|
|
3
|
+
Detailed specifications for the five-layer QA architecture. Each layer builds on the one below it.
|
|
4
|
+
|
|
5
|
+
## Layer Details
|
|
6
|
+
|
|
7
|
+
### L1: Text Integrity
|
|
8
|
+
|
|
9
|
+
- **Description**: Verify that source files exist, are readable, and that text content is preserved correctly after any processing (parsing, OCR, conversion).
|
|
10
|
+
- **Input**: Raw document files and their processed text output.
|
|
11
|
+
- **Output**: Pass/fail per file with error details.
|
|
12
|
+
- **Example checks**: File exists and is non-empty. Encoding is UTF-8 (or declared encoding). No null bytes in text output. Character count is within expected range for document type.
|
|
13
|
+
- **Common failures**: File path changed after processing. OCR produced empty output. Encoding mismatch causes garbled characters.
|
|
14
|
+
- **Escalation**: If L1 fails, do not proceed to higher layers. Log the failure and flag for reprocessing.
|
|
15
|
+
|
|
16
|
+
### L2: Syntax
|
|
17
|
+
|
|
18
|
+
- **Description**: Verify that output files conform to their declared format and schema.
|
|
19
|
+
- **Input**: Output files (JSON, CSV, etc.) from workflows.
|
|
20
|
+
- **Output**: Pass/fail per file with parse errors or schema violations.
|
|
21
|
+
- **Example checks**: JSON is valid (parses without error). Required top-level keys exist. Array fields are arrays, not strings. Date fields match ISO 8601 format.
|
|
22
|
+
- **Common failures**: Trailing comma in JSON. Missing closing bracket. CSV with inconsistent column count. Unexpected null where value is required.
|
|
23
|
+
- **Escalation**: Syntax failures indicate a bug in the output generation code. Fix the code, not the data.
|
|
24
|
+
|
|
25
|
+
### L3: Data Completeness
|
|
26
|
+
|
|
27
|
+
- **Description**: Verify that required data fields are populated with values in their valid domain.
|
|
28
|
+
- **Input**: Parsed output records.
|
|
29
|
+
- **Output**: Per-field validation results with reasons for any failures.
|
|
30
|
+
- **Example checks**: Invoice date is a valid date (not "N/A" or empty). Amount is a positive number. Entity name is non-empty and does not contain only whitespace. Enum fields contain allowed values.
|
|
31
|
+
- **Common failures**: Extraction returned "unable to determine" as a value. Amount includes currency symbol (string instead of number). Date extracted as partial (month and day but no year).
|
|
32
|
+
- **Escalation**: Completeness failures feed back to extraction prompt improvement. If a field is consistently incomplete, the extraction logic needs work.
|
|
33
|
+
|
|
34
|
+
### L4: Business Logic
|
|
35
|
+
|
|
36
|
+
- **Description**: Verify cross-field consistency and compliance with business rules.
|
|
37
|
+
- **Input**: Complete, validated records from L3.
|
|
38
|
+
- **Output**: Per-rule validation results with reasoning.
|
|
39
|
+
- **Example checks**: Contract end date is after start date. Invoice date falls within contract validity period. Total amount equals sum of line items. Signatory name matches authorized personnel list.
|
|
40
|
+
- **Common failures**: Date comparison fails due to timezone differences. Rounding errors in amount calculations. Cross-reference lookup fails because entity names differ slightly (e.g., "ABC Corp" vs "ABC Corporation").
|
|
41
|
+
- **Escalation**: Business logic failures may indicate rule misunderstanding. Consult the developer user if the rule intent is ambiguous.
|
|
42
|
+
|
|
43
|
+
### L5: Cross-Phase
|
|
44
|
+
|
|
45
|
+
- **Description**: Verify consistency across different phases of the verification pipeline.
|
|
46
|
+
- **Input**: Outputs from multiple pipeline stages (extraction, verification, reporting).
|
|
47
|
+
- **Output**: Cross-phase consistency report.
|
|
48
|
+
- **Example checks**: Entities in final results match those in extraction output (nothing added or dropped). Rule IDs in results exist in the rule catalog. Workflow output for a skill matches the skill's own ground truth output. Confidence scores in results match those computed by the confidence system.
|
|
49
|
+
- **Common failures**: A rule was added to the catalog but the workflow was not updated to include it. Extraction found 5 entities but results only report 4. Workflow output diverges from skill ground truth on edge cases.
|
|
50
|
+
- **Escalation**: Cross-phase failures often indicate integration issues. Check the pipeline connections, not individual components.
|
|
51
|
+
|
|
52
|
+
## Script Naming Convention
|
|
53
|
+
|
|
54
|
+
| Prefix | Layer | Purpose | Examples |
|
|
55
|
+
|--------|-------|---------|----------|
|
|
56
|
+
| `lint_` | L1-L2 | Fast, syntactic checks | `lint_json.py`, `lint_encoding.py`, `lint_schema.py` |
|
|
57
|
+
| `validate_` | L3-L4 | Domain and logic validation | `validate_fields.py`, `validate_dates.py`, `validate_amounts.py` |
|
|
58
|
+
| `cross_validate_` | L5 | Cross-phase consistency | `cross_validate_extraction.py`, `cross_validate_rules.py` |
|
|
59
|
+
|
|
60
|
+
Scripts should:
|
|
61
|
+
- Accept a file or directory path as input.
|
|
62
|
+
- Output structured JSON results (pass/fail per check, with reasons).
|
|
63
|
+
- Return exit code 0 if all checks pass, non-zero otherwise.
|
|
64
|
+
- Be idempotent — running twice produces the same result.
|
|
65
|
+
|
|
66
|
+
## QC vs Reflection
|
|
67
|
+
|
|
68
|
+
| Dimension | QC (this skill) | Reflection (evolution-loop) |
|
|
69
|
+
|-----------|-----------------|---------------------------|
|
|
70
|
+
| **Who runs it** | Coding agent or automated scripts | Coding agent |
|
|
71
|
+
| **What triggers it** | Every batch, on schedule | QC failures, accuracy drops |
|
|
72
|
+
| **Input** | Workflow outputs | QC reports, failure logs, iteration history |
|
|
73
|
+
| **Output** | Pass/fail verdicts, accuracy metrics | Root cause diagnosis, fix proposals |
|
|
74
|
+
| **Cost** | Low (mostly scripts, some LLM at L4-L5) | Higher (deep analysis, prompt rewriting) |
|
|
75
|
+
| **When to use** | Always — every production batch | Only when QC reveals problems |
|
|
76
|
+
| **Goal** | Detect problems | Fix problems |
|
|
77
|
+
|
|
78
|
+
QC without Reflection detects issues but cannot fix them. Reflection without QC has no data to work from. They are complementary, not alternatives.
|
|
79
|
+
|
|
80
|
+
## Integration Points
|
|
81
|
+
|
|
82
|
+
### With `data-sensibility`
|
|
83
|
+
|
|
84
|
+
The `data-sensibility` skill provides input validation that feeds L1-L3. If data-sensibility checks flag a document as anomalous before processing, QC can prioritize reviewing that document's outputs. Data-sensibility operates on inputs; QC operates on outputs. Together they bracket the pipeline.
|
|
85
|
+
|
|
86
|
+
### With `cross-document-verification`
|
|
87
|
+
|
|
88
|
+
Cross-document verification enables L5 cross-doc consistency checks. When multiple documents reference the same entity (e.g., same contract number across invoice and purchase order), L5 can verify that extracted values are consistent across documents. Without cross-document verification, L5 is limited to single-document cross-phase checks.
|
|
89
|
+
|
|
90
|
+
### With `confidence-system`
|
|
91
|
+
|
|
92
|
+
QC results calibrate the confidence system. When QC reveals that high-confidence results are sometimes wrong, the confidence thresholds need adjustment. Conversely, confidence scores drive QC sampling — low-confidence results get more review. This creates a feedback loop: QC improves confidence calibration, better calibration improves QC efficiency.
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# Sampling Strategies for Quality Control
|
|
2
|
+
|
|
3
|
+
## Adaptive Sampling
|
|
4
|
+
|
|
5
|
+
The core idea: review more when you are uncertain, less when you are confident. Confidence grows with evidence — consecutive batches of high accuracy.
|
|
6
|
+
|
|
7
|
+
### Continuous Decay Model
|
|
8
|
+
|
|
9
|
+
Rather than cliff-edge transitions between phases, use a smooth exponential decay driven by observed accuracy:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
sampling_rate = max(floor_rate, exp(-λ × consecutive_successes))
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Where:
|
|
16
|
+
- `consecutive_successes`: number of consecutive batches where accuracy meets or exceeds the threshold. **Resets to 0** whenever a batch's accuracy drops below the threshold. This is the self-correcting mechanism — quality drops immediately increase monitoring.
|
|
17
|
+
- `λ` (decay speed): controlled by MONITOR_FREQUENCY in `.env`.
|
|
18
|
+
- `floor_rate`: the minimum sampling rate, never goes below this.
|
|
19
|
+
|
|
20
|
+
### MONITOR_FREQUENCY Mapping
|
|
21
|
+
|
|
22
|
+
| Setting | λ | floor_rate | Character |
|
|
23
|
+
|---------|---|------------|-----------|
|
|
24
|
+
| `high` | 0.1 | 0.10 | Slow decay, cautious — for high-stakes verification where errors are costly |
|
|
25
|
+
| `mid` | 0.2 | 0.05 | Balanced decay — standard for most scenarios |
|
|
26
|
+
| `low` | 0.3 | 0.05 | Fast decay — for well-understood domains with simple rules |
|
|
27
|
+
|
|
28
|
+
As a rough mental model of the curve shape (for `mid`):
|
|
29
|
+
- After 1 success: ~82% sampling
|
|
30
|
+
- After 3 successes: ~55%
|
|
31
|
+
- After 5 successes: ~37%
|
|
32
|
+
- After 10 successes: ~14%
|
|
33
|
+
- After 15 successes: ~5% (floor)
|
|
34
|
+
|
|
35
|
+
These numbers, the formula, and even the exponential shape are recommended defaults. The coding agent and developer user should discuss and calibrate based on the specific business scenario. If a different decay function (linear, sigmoid, or hand-tuned) works better, use it. The framework — accuracy-driven decay with reset on quality drop — matters more than the specific formula.
|
|
36
|
+
|
|
37
|
+
## Priority Sampling
|
|
38
|
+
|
|
39
|
+
Not all results are equally worth reviewing. Priority sampling ensures that the most informative results are always in the review set:
|
|
40
|
+
|
|
41
|
+
### Always Review
|
|
42
|
+
- Results where the workflow reported low confidence (below the full-review threshold from `confidence-system`).
|
|
43
|
+
- Results where the workflow produced an error or missing result.
|
|
44
|
+
- Results from document types not seen during skill/workflow testing.
|
|
45
|
+
|
|
46
|
+
### Usually Review
|
|
47
|
+
- Results where the workflow's confidence is in the medium band.
|
|
48
|
+
- Results from rules that historically have lower accuracy.
|
|
49
|
+
- Results from the first occurrence of a new document format or variant.
|
|
50
|
+
|
|
51
|
+
### Spot-Check
|
|
52
|
+
- Results with high confidence from rules that historically have high accuracy.
|
|
53
|
+
- These are selected randomly from the high-confidence pool.
|
|
54
|
+
- The purpose is regression detection, not active improvement.
|
|
55
|
+
|
|
56
|
+
## Stratified Sampling
|
|
57
|
+
|
|
58
|
+
When documents vary significantly in complexity or type, stratify the sample:
|
|
59
|
+
|
|
60
|
+
1. **Group documents** by type, complexity, or any relevant characteristic.
|
|
61
|
+
2. **Sample proportionally** from each group, ensuring that minority groups are represented.
|
|
62
|
+
3. **Over-sample** from groups that historically have lower accuracy.
|
|
63
|
+
|
|
64
|
+
This prevents the random sample from being dominated by easy documents while missing systematic failures in hard documents.
|
|
65
|
+
|
|
66
|
+
## Confidence Calibration Check
|
|
67
|
+
|
|
68
|
+
Periodically (every N batches), run a calibration check:
|
|
69
|
+
|
|
70
|
+
1. Take a random sample of high-confidence results.
|
|
71
|
+
2. Review them (LLM-as-Judge or human).
|
|
72
|
+
3. Compare: are 90%+ of "high confidence" results actually correct?
|
|
73
|
+
4. If not, the confidence system needs recalibration (see `confidence-system` skill).
|
|
74
|
+
5. If yes, you can safely reduce the sampling rate for high-confidence results.
|
|
75
|
+
|
|
76
|
+
This is a meta-check on the quality of the quality control system itself.
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: rule-extraction
|
|
3
|
+
description: Extract and organize business verification rules from regulation documents into discrete, testable units. Use when processing documents in Rules/ to identify individual verification rules, when decomposing a regulation into atomic checks, or when the developer user adds new regulation files. Covers reading regulation text, identifying rule boundaries, determining granularity, handling cross-references, and producing a rule catalog. Also use when rules are provided in structured formats like xlsx or csv.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# 法规条文解构与核查规则提取
|
|
7
|
+
|
|
8
|
+
## 核心理念
|
|
9
|
+
|
|
10
|
+
规则是整个核查体系的原子单元。一条规则对应一个技能文件夹,一个技能文件夹对应一个可独立测试的核查逻辑。规则提取的质量直接决定后续所有环节的上限——技能编写、工作流蒸馏、质量监控,全部建立在规则提取的基础之上。
|
|
11
|
+
|
|
12
|
+
提取得好,后面事半功倍。提取得差,后面反复返工。
|
|
13
|
+
|
|
14
|
+
## 高质量规则的四个特征
|
|
15
|
+
|
|
16
|
+
### 原子性
|
|
17
|
+
|
|
18
|
+
一条规则只做一件事。如果你发现一条规则需要用「并且」「同时」连接两个独立的判断逻辑,大概率应该拆成两条规则。
|
|
19
|
+
|
|
20
|
+
反例:「发票日期应在合同有效期内,且发票金额不超过合同总额」——这是两条规则。
|
|
21
|
+
|
|
22
|
+
### 可测试性
|
|
23
|
+
|
|
24
|
+
规则的判定结果必须是明确的:通过、不通过、无法判定。不能出现「大致合理」「基本符合」这种模糊结论。如果一条规则无法给出确定性结论,说明它还没提取到位。
|
|
25
|
+
|
|
26
|
+
### 自包含性
|
|
27
|
+
|
|
28
|
+
规则的执行不应依赖于其他规则的执行结果。每条规则应该能独立运行。如果规则A的判定需要先知道规则B的结果,说明存在耦合,需要重新设计。
|
|
29
|
+
|
|
30
|
+
例外:交叉验证类规则(如「发票金额与合同金额一致」)本身就是一条独立规则,它依赖的是数据字段而非其他规则的结论。
|
|
31
|
+
|
|
32
|
+
### 明确的作用域
|
|
33
|
+
|
|
34
|
+
规则必须清楚说明它适用于什么类型的单据、什么业务场景、什么前提条件。作用域模糊的规则在实际核查中会产生大量误判。
|
|
35
|
+
|
|
36
|
+
## 规则体系的系统性设计原则
|
|
37
|
+
|
|
38
|
+
单条规则应当满足上述四个特征。规则目录作为一个整体,还需要满足系统级属性:
|
|
39
|
+
|
|
40
|
+
### 覆盖度目标
|
|
41
|
+
提取的规则应覆盖法规可核查要求的至少 95%。初次提取完成后,执行覆盖度审计:端到端通读原始法规,标注每个段落是否被至少一条规则覆盖。未覆盖的段落要么是非核查性内容(定义、背景),要么是需要补充的空白。
|
|
42
|
+
|
|
43
|
+
### 原子性测试
|
|
44
|
+
一条规则 = 一个通过/不通过结论。如果一条规则能产出两个独立的通过/不通过结果,它应该被拆为两条规则。自问:「这条规则能部分通过吗?」如果能,继续拆分。
|
|
45
|
+
|
|
46
|
+
### 歧义最小化
|
|
47
|
+
不能有两条规则对同一文档的同一实体给出矛盾结论。提取完成后,审查作用域重叠的规则对。如果规则 A 判定通过而规则 B 判定不通过(针对同一实体),说明它们的作用域边界不清——必须修正。
|
|
48
|
+
|
|
49
|
+
### 下游预判
|
|
50
|
+
规则最终将被蒸馏为工作流(参见 `skill-to-workflow`)。设计时就要考虑蒸馏的需求:清晰的输入/输出边界、显式的判定标准、尽量减少对隐含领域知识的依赖。如果一条规则需要「读出言外之意」,把那个解读显式写出来。使用 `task-decomposition` 来识别规则之间的自然边界。
|
|
51
|
+
|
|
52
|
+
### 目录版本化
|
|
53
|
+
当规则发生变更(新增、修改、废弃)时,将整个规则目录作为一个整体进行版本化。单条规则的版本跟踪的是具体规则;目录版本跟踪的是规则集的一致性状态。在 `versions.json` 中记录目录版本,与单条规则版本并列。
|
|
54
|
+
|
|
55
|
+
## 策略一:结构化输入(开发者用户提供规则表格)
|
|
56
|
+
|
|
57
|
+
当开发者用户以 xlsx、csv 或其他结构化格式提供规则清单时,这是最理想的情况。
|
|
58
|
+
|
|
59
|
+
### 处理步骤
|
|
60
|
+
|
|
61
|
+
1. 读取文件,理解表格结构(列名、分组方式)
|
|
62
|
+
2. 尊重开发者用户的规则划分——他们比你更懂业务
|
|
63
|
+
3. 为每条规则生成标准编号(R001、R002...)
|
|
64
|
+
4. 检查是否存在隐含的复合规则需要进一步拆分
|
|
65
|
+
5. 补充表格中可能缺失的信息:适用范围、前提条件、判定标准
|
|
66
|
+
|
|
67
|
+
### 注意事项
|
|
68
|
+
|
|
69
|
+
- 不要擅自合并开发者用户已经拆分的规则
|
|
70
|
+
- 如果表格中某条规则的描述过于笼统,标记为「待细化」,向开发者用户确认
|
|
71
|
+
- 保留原始表格的引用关系(如行号),便于回溯
|
|
72
|
+
|
|
73
|
+
## 策略二:从法规原文层层剥解
|
|
74
|
+
|
|
75
|
+
当输入是法规文件、监管通知、内部制度等非结构化文本时,采用「洋葱剥皮法」逐层提取。
|
|
76
|
+
|
|
77
|
+
### 第一层:通览全文结构
|
|
78
|
+
|
|
79
|
+
快速阅读,识别文件的组织方式:
|
|
80
|
+
- 章节编号体系(第X条、第X款、第X项)
|
|
81
|
+
- 哪些章节是定义性条款(不含核查规则)
|
|
82
|
+
- 哪些章节是实质性要求(包含核查规则)
|
|
83
|
+
- 哪些章节是程序性条款(审批流程,可能包含时限类规则)
|
|
84
|
+
- 附则、附件中是否有补充规则
|
|
85
|
+
|
|
86
|
+
### 第二层:识别规则承载段落
|
|
87
|
+
|
|
88
|
+
聚焦于实质性要求章节,逐段判断:
|
|
89
|
+
- 该段落是否包含「应当」「必须」「不得」「需要」等规范性用语?
|
|
90
|
+
- 该段落是否描述了可以被验证的具体要求?
|
|
91
|
+
- 该段落是叙述性说明还是操作性要求?
|
|
92
|
+
|
|
93
|
+
只有包含可核查要求的段落才进入下一层处理。
|
|
94
|
+
|
|
95
|
+
### 第三层:逐段提取规则
|
|
96
|
+
|
|
97
|
+
对每个规则承载段落:
|
|
98
|
+
|
|
99
|
+
1. 提取核查对象(哪个字段、哪份单据)
|
|
100
|
+
2. 提取核查标准(什么条件、什么阈值)
|
|
101
|
+
3. 提取适用范围(什么业务场景、什么前提条件)
|
|
102
|
+
4. 提取例外情形(什么情况下不适用)
|
|
103
|
+
5. 标注原文位置(法规名称 + 条款编号)
|
|
104
|
+
|
|
105
|
+
### 第四层:处理交叉引用
|
|
106
|
+
|
|
107
|
+
法规文本中常见的交叉引用模式:
|
|
108
|
+
- 「依据本办法第X条规定」——需要找到被引用条款,整合上下文
|
|
109
|
+
- 「参照XX法规执行」——需要确认外部法规是否在 `Rules/` 中,如果没有则标记为待补充
|
|
110
|
+
- 「前款所述情形」——需要回溯上文,明确指代
|
|
111
|
+
|
|
112
|
+
处理原则:将交叉引用解析为自包含的规则描述,在规则的 `references/` 中保留原始引用关系。
|
|
113
|
+
|
|
114
|
+
### 第五层:拆解复合规则
|
|
115
|
+
|
|
116
|
+
识别并拆分以下模式:
|
|
117
|
+
|
|
118
|
+
- 并列条件:「A且B」→ 拆为规则A、规则B
|
|
119
|
+
- 条件分支:「若X则A,若Y则B」→ 拆为规则A(前提X)、规则B(前提Y)
|
|
120
|
+
- 阶梯条件:「金额≤10万执行A流程,金额>10万执行B流程」→ 拆为规则A(阈值条件)、规则B(阈值条件)
|
|
121
|
+
|
|
122
|
+
不需要拆分的情形:
|
|
123
|
+
- 规则本身就是一个条件判断:「发票日期应在合同有效期内」——这就是一条原子规则
|
|
124
|
+
- 规则包含多个字段但逻辑统一:「收款单位名称、账号应与合同约定一致」——可以保留为一条规则,因为核查逻辑相同
|
|
125
|
+
|
|
126
|
+
## 策略三:开发者用户口述的专家经验
|
|
127
|
+
|
|
128
|
+
有时候开发者用户不会给你法规文件,而是直接告诉你业务经验和核查要点。这种输入同样有效。
|
|
129
|
+
|
|
130
|
+
### 处理方式
|
|
131
|
+
|
|
132
|
+
1. 完整记录开发者用户的口述内容
|
|
133
|
+
2. 将口述转化为结构化规则(编号、名称、核查逻辑、判定标准)
|
|
134
|
+
3. 回读给开发者用户确认,特别注意:
|
|
135
|
+
- 是否遗漏了隐含的前提条件
|
|
136
|
+
- 阈值和标准是否准确
|
|
137
|
+
- 是否有例外情况没提到
|
|
138
|
+
4. 在规则来源中标注「专家经验」而非法规条文
|
|
139
|
+
|
|
140
|
+
## 规则目录管理
|
|
141
|
+
|
|
142
|
+
所有提取的规则汇总为一份轻量级目录,存放在工作空间根目录的 `rule-catalog.json` 中:
|
|
143
|
+
|
|
144
|
+
```json
|
|
145
|
+
{
|
|
146
|
+
"rules": [
|
|
147
|
+
{
|
|
148
|
+
"id": "R001",
|
|
149
|
+
"name": "发票日期有效性",
|
|
150
|
+
"source": "《增值税发票管理办法》第十五条",
|
|
151
|
+
"status": "extracted",
|
|
152
|
+
"priority": "high",
|
|
153
|
+
"skill_folder": "rule-skills/R001-invoice-date-validity/",
|
|
154
|
+
"notes": ""
|
|
155
|
+
}
|
|
156
|
+
],
|
|
157
|
+
"total": 1,
|
|
158
|
+
"extracted": 1,
|
|
159
|
+
"skill_authored": 0,
|
|
160
|
+
"workflow_distilled": 0,
|
|
161
|
+
"last_updated": "<时间戳>"
|
|
162
|
+
}
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### 状态流转
|
|
166
|
+
|
|
167
|
+
每条规则的生命周期状态:
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
extracted → skill_authored → skill_tested → workflow_distilled → workflow_tested → production
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
目录中实时跟踪每条规则所处的阶段。
|
|
174
|
+
|
|
175
|
+
## 处理模糊与歧义
|
|
176
|
+
|
|
177
|
+
法规条文中经常存在模糊表述,如「合理期限内」「必要时」「视情况而定」。处理原则:
|
|
178
|
+
|
|
179
|
+
1. **先提取,不要跳过**——模糊不等于不重要
|
|
180
|
+
2. **在规则中标注歧义**——明确指出哪个部分存在解读空间
|
|
181
|
+
3. **向开发者用户确认**——提供你的理解,请开发者用户裁定
|
|
182
|
+
4. **确认后更新规则**——将开发者用户的裁定写入规则描述
|
|
183
|
+
|
|
184
|
+
绝对不要自行决定模糊条款的含义。你对业务的理解不如开发者用户。
|
|
185
|
+
|
|
186
|
+
## 法规变更时的处理
|
|
187
|
+
|
|
188
|
+
当 `Rules/` 中新增或修改了法规文件时:
|
|
189
|
+
|
|
190
|
+
1. 对比新旧版本,识别变更点
|
|
191
|
+
2. 定位受影响的已有规则
|
|
192
|
+
3. 判断影响程度:
|
|
193
|
+
- 措辞调整但核查逻辑不变——更新引用文本,无需重新测试
|
|
194
|
+
- 阈值或标准变更——更新规则参数,需要重新测试
|
|
195
|
+
- 新增核查要求——提取新规则
|
|
196
|
+
- 废止原有要求——将规则标记为 `deprecated`,不删除
|
|
197
|
+
4. 更新规则目录
|
|
198
|
+
5. 通知开发者用户变更影响范围
|
|
199
|
+
|
|
200
|
+
## 输出交付物
|
|
201
|
+
|
|
202
|
+
规则提取阶段完成后,应产出:
|
|
203
|
+
|
|
204
|
+
1. `rule-catalog.json`——规则总目录
|
|
205
|
+
2. 每条规则的初始描述文档(存放在对应技能文件夹的草稿中)
|
|
206
|
+
3. 模糊与歧义清单(待开发者用户确认)
|
|
207
|
+
4. 交叉引用映射(规则之间、规则与法规之间的引用关系)
|
|
208
|
+
5. 向开发者用户汇报提取结果的摘要
|