kc-beta 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/kc-beta.js +16 -0
- package/package.json +32 -0
- package/src/agent/confidence-scorer.js +120 -0
- package/src/agent/context.js +124 -0
- package/src/agent/corner-case-registry.js +119 -0
- package/src/agent/engine.js +224 -0
- package/src/agent/events.js +27 -0
- package/src/agent/history.js +101 -0
- package/src/agent/llm-client.js +131 -0
- package/src/agent/pipelines/base.js +14 -0
- package/src/agent/pipelines/distillation.js +113 -0
- package/src/agent/pipelines/extraction.js +92 -0
- package/src/agent/pipelines/index.js +23 -0
- package/src/agent/pipelines/initializer.js +163 -0
- package/src/agent/pipelines/production-qc.js +99 -0
- package/src/agent/pipelines/skill-authoring.js +83 -0
- package/src/agent/pipelines/skill-testing.js +111 -0
- package/src/agent/tools/agent-tool.js +100 -0
- package/src/agent/tools/base.js +35 -0
- package/src/agent/tools/dashboard-render.js +146 -0
- package/src/agent/tools/document-parse.js +184 -0
- package/src/agent/tools/document-search.js +111 -0
- package/src/agent/tools/evolution-cycle.js +150 -0
- package/src/agent/tools/qc-sample.js +94 -0
- package/src/agent/tools/registry.js +55 -0
- package/src/agent/tools/rule-catalog.js +113 -0
- package/src/agent/tools/sandbox-exec.js +106 -0
- package/src/agent/tools/tier-downgrade.js +114 -0
- package/src/agent/tools/worker-llm-call.js +109 -0
- package/src/agent/tools/workflow-run.js +138 -0
- package/src/agent/tools/workspace-file.js +122 -0
- package/src/agent/version-manager.js +130 -0
- package/src/agent/workspace.js +82 -0
- package/src/cli/components.js +164 -0
- package/src/cli/index.js +329 -0
- package/src/cli/init.js +80 -0
- package/src/cli/onboard.js +182 -0
- package/src/cli/terminal.js +143 -0
- package/src/config.js +93 -0
- package/template/.env.template +31 -0
- package/template/CLAUDE.md +137 -0
- package/template/Input/.gitkeep +0 -0
- package/template/Output/.gitkeep +0 -0
- package/template/Rules/.gitkeep +0 -0
- package/template/Samples/.gitkeep +0 -0
- package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
- package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
- package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
- package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
- package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
- package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
- package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
- package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
- package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
- package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
- package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
- package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
- package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
- package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
- package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
- package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
- package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
- package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
- package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/en/skill-creator/LICENSE.txt +202 -0
- package/template/skills/en/skill-creator/SKILL.md +479 -0
- package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/en/skill-creator/agents/comparator.md +202 -0
- package/template/skills/en/skill-creator/agents/grader.md +223 -0
- package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/en/skill-creator/references/schemas.md +430 -0
- package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/en/skill-creator/scripts/utils.py +47 -0
- package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
- package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
- package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
- package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
- package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
- package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
- package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
- package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
- package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
- package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
- package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
- package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
- package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
- package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
- package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
- package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
- package/template/skills/zh/skill-creator/SKILL.md +479 -0
- package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
- package/template/skills/zh/skill-creator/agents/grader.md +223 -0
- package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/zh/skill-creator/references/schemas.md +430 -0
- package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Dashboard Generator — Starter Script
|
|
3
|
+
|
|
4
|
+
Generates a self-contained HTML dashboard from verification results.
|
|
5
|
+
This is a starting point. The coding agent should customize it for the
|
|
6
|
+
specific business scenario.
|
|
7
|
+
|
|
8
|
+
Usage:
|
|
9
|
+
python generate_dashboard.py --input <output_dir> --output <dashboard.html>
|
|
10
|
+
|
|
11
|
+
Input: A directory containing verification result JSON files.
|
|
12
|
+
Output: A single self-contained HTML file.
|
|
13
|
+
"""
|
|
14
|
+
|
|
15
|
+
import argparse
|
|
16
|
+
import json
|
|
17
|
+
import os
|
|
18
|
+
from datetime import datetime
|
|
19
|
+
from pathlib import Path
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
def load_results(input_dir: str) -> list[dict]:
|
|
23
|
+
"""Load all JSON result files from the input directory."""
|
|
24
|
+
results = []
|
|
25
|
+
for f in Path(input_dir).glob("**/*.json"):
|
|
26
|
+
if "dashboard" in str(f) or "versions" in str(f):
|
|
27
|
+
continue
|
|
28
|
+
try:
|
|
29
|
+
with open(f) as fh:
|
|
30
|
+
data = json.load(fh)
|
|
31
|
+
if isinstance(data, list):
|
|
32
|
+
results.extend(data)
|
|
33
|
+
elif isinstance(data, dict) and "results" in data:
|
|
34
|
+
results.extend(data["results"])
|
|
35
|
+
elif isinstance(data, dict) and "rule_id" in data:
|
|
36
|
+
results.append(data)
|
|
37
|
+
except (json.JSONDecodeError, KeyError):
|
|
38
|
+
continue
|
|
39
|
+
return results
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
def compute_summary(results: list[dict]) -> dict:
|
|
43
|
+
"""Compute summary statistics from results."""
|
|
44
|
+
total = len(results)
|
|
45
|
+
if total == 0:
|
|
46
|
+
return {"total": 0, "pass": 0, "fail": 0, "missing": 0, "error": 0}
|
|
47
|
+
|
|
48
|
+
summary = {
|
|
49
|
+
"total": total,
|
|
50
|
+
"pass": sum(1 for r in results if r.get("result") == "pass"),
|
|
51
|
+
"fail": sum(1 for r in results if r.get("result") == "fail"),
|
|
52
|
+
"missing": sum(1 for r in results if r.get("result") == "missing"),
|
|
53
|
+
"error": sum(1 for r in results if r.get("result") == "error"),
|
|
54
|
+
}
|
|
55
|
+
summary["pass_rate"] = round(summary["pass"] / total * 100, 1)
|
|
56
|
+
return summary
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def compute_per_rule(results: list[dict]) -> dict:
|
|
60
|
+
"""Compute per-rule statistics."""
|
|
61
|
+
rules = {}
|
|
62
|
+
for r in results:
|
|
63
|
+
rule_id = r.get("rule_id", "unknown")
|
|
64
|
+
if rule_id not in rules:
|
|
65
|
+
rules[rule_id] = {"total": 0, "pass": 0, "fail": 0, "results": []}
|
|
66
|
+
rules[rule_id]["total"] += 1
|
|
67
|
+
if r.get("result") == "pass":
|
|
68
|
+
rules[rule_id]["pass"] += 1
|
|
69
|
+
elif r.get("result") == "fail":
|
|
70
|
+
rules[rule_id]["fail"] += 1
|
|
71
|
+
rules[rule_id]["results"].append(r)
|
|
72
|
+
|
|
73
|
+
for rule_id, data in rules.items():
|
|
74
|
+
data["accuracy"] = round(data["pass"] / data["total"] * 100, 1) if data["total"] > 0 else 0
|
|
75
|
+
|
|
76
|
+
return rules
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> str:
|
|
80
|
+
"""Generate a self-contained HTML dashboard."""
|
|
81
|
+
|
|
82
|
+
rule_rows = ""
|
|
83
|
+
for rule_id, data in sorted(per_rule.items()):
|
|
84
|
+
color = "#4caf50" if data["accuracy"] >= 90 else "#ff9800" if data["accuracy"] >= 70 else "#f44336"
|
|
85
|
+
rule_rows += f"""
|
|
86
|
+
<tr>
|
|
87
|
+
<td>{rule_id}</td>
|
|
88
|
+
<td>{data['total']}</td>
|
|
89
|
+
<td>{data['pass']}</td>
|
|
90
|
+
<td>{data['fail']}</td>
|
|
91
|
+
<td style="color: {color}; font-weight: bold;">{data['accuracy']}%</td>
|
|
92
|
+
</tr>"""
|
|
93
|
+
|
|
94
|
+
fail_rows = ""
|
|
95
|
+
for case in failed_cases[:100]: # Limit to first 100
|
|
96
|
+
fail_rows += f"""
|
|
97
|
+
<tr>
|
|
98
|
+
<td>{case.get('document', 'N/A')}</td>
|
|
99
|
+
<td>{case.get('rule_id', 'N/A')}</td>
|
|
100
|
+
<td>{case.get('extracted_value', 'N/A')}</td>
|
|
101
|
+
<td>{case.get('comment', 'N/A')}</td>
|
|
102
|
+
<td>{case.get('confidence', 'N/A')}</td>
|
|
103
|
+
</tr>"""
|
|
104
|
+
|
|
105
|
+
html = f"""<!DOCTYPE html>
|
|
106
|
+
<html lang="en">
|
|
107
|
+
<head>
|
|
108
|
+
<meta charset="UTF-8">
|
|
109
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
110
|
+
<title>KC Reborn — Verification Dashboard</title>
|
|
111
|
+
<style>
|
|
112
|
+
:root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
|
|
113
|
+
@media (prefers-color-scheme: light) {{
|
|
114
|
+
:root {{ --bg: #f5f5f5; --surface: #ffffff; --text: #333333; }}
|
|
115
|
+
}}
|
|
116
|
+
* {{ margin: 0; padding: 0; box-sizing: border-box; }}
|
|
117
|
+
body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: var(--bg); color: var(--text); padding: 20px; }}
|
|
118
|
+
h1 {{ margin-bottom: 20px; }}
|
|
119
|
+
.summary {{ display: flex; gap: 16px; margin-bottom: 24px; flex-wrap: wrap; }}
|
|
120
|
+
.card {{ background: var(--surface); padding: 20px; border-radius: 8px; min-width: 140px; text-align: center; }}
|
|
121
|
+
.card .number {{ font-size: 2em; font-weight: bold; }}
|
|
122
|
+
.card .label {{ font-size: 0.85em; opacity: 0.7; margin-top: 4px; }}
|
|
123
|
+
table {{ width: 100%; border-collapse: collapse; background: var(--surface); border-radius: 8px; overflow: hidden; margin-bottom: 24px; }}
|
|
124
|
+
th, td {{ padding: 10px 14px; text-align: left; border-bottom: 1px solid rgba(255,255,255,0.1); }}
|
|
125
|
+
th {{ background: rgba(0,0,0,0.2); font-weight: 600; }}
|
|
126
|
+
h2 {{ margin: 20px 0 12px; }}
|
|
127
|
+
.timestamp {{ opacity: 0.5; font-size: 0.85em; margin-bottom: 20px; }}
|
|
128
|
+
</style>
|
|
129
|
+
</head>
|
|
130
|
+
<body>
|
|
131
|
+
<h1>Verification Dashboard</h1>
|
|
132
|
+
<div class="timestamp">Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</div>
|
|
133
|
+
|
|
134
|
+
<div class="summary">
|
|
135
|
+
<div class="card"><div class="number">{summary['total']}</div><div class="label">Total</div></div>
|
|
136
|
+
<div class="card"><div class="number" style="color: var(--accent);">{summary['pass']}</div><div class="label">Pass</div></div>
|
|
137
|
+
<div class="card"><div class="number" style="color: var(--err);">{summary['fail']}</div><div class="label">Fail</div></div>
|
|
138
|
+
<div class="card"><div class="number" style="color: var(--warn);">{summary.get('missing', 0)}</div><div class="label">Missing</div></div>
|
|
139
|
+
<div class="card"><div class="number">{summary.get('pass_rate', 0)}%</div><div class="label">Pass Rate</div></div>
|
|
140
|
+
</div>
|
|
141
|
+
|
|
142
|
+
<h2>Per-Rule Breakdown</h2>
|
|
143
|
+
<table>
|
|
144
|
+
<thead><tr><th>Rule</th><th>Total</th><th>Pass</th><th>Fail</th><th>Accuracy</th></tr></thead>
|
|
145
|
+
<tbody>{rule_rows}</tbody>
|
|
146
|
+
</table>
|
|
147
|
+
|
|
148
|
+
<h2>Failed Cases</h2>
|
|
149
|
+
<table>
|
|
150
|
+
<thead><tr><th>Document</th><th>Rule</th><th>Extracted Value</th><th>Comment</th><th>Confidence</th></tr></thead>
|
|
151
|
+
<tbody>{fail_rows if fail_rows else '<tr><td colspan="5" style="text-align:center; opacity:0.5;">No failures</td></tr>'}</tbody>
|
|
152
|
+
</table>
|
|
153
|
+
</body>
|
|
154
|
+
</html>"""
|
|
155
|
+
return html
|
|
156
|
+
|
|
157
|
+
|
|
158
|
+
def main():
|
|
159
|
+
parser = argparse.ArgumentParser(description="Generate verification dashboard")
|
|
160
|
+
parser.add_argument("--input", required=True, help="Directory containing result JSON files")
|
|
161
|
+
parser.add_argument("--output", default="dashboard.html", help="Output HTML file path")
|
|
162
|
+
args = parser.parse_args()
|
|
163
|
+
|
|
164
|
+
results = load_results(args.input)
|
|
165
|
+
summary = compute_summary(results)
|
|
166
|
+
per_rule = compute_per_rule(results)
|
|
167
|
+
failed_cases = [r for r in results if r.get("result") == "fail"]
|
|
168
|
+
|
|
169
|
+
html = generate_html(summary, per_rule, failed_cases)
|
|
170
|
+
|
|
171
|
+
output_path = Path(args.output)
|
|
172
|
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
173
|
+
output_path.write_text(html, encoding="utf-8")
|
|
174
|
+
print(f"Dashboard generated: {output_path}")
|
|
175
|
+
|
|
176
|
+
|
|
177
|
+
if __name__ == "__main__":
|
|
178
|
+
main()
|
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolution-loop
|
|
3
|
+
description: Drive continuous improvement of skills and workflows through the diagnose-classify-fix-retest cycle. Use after any testing round reveals failures, when production quality control flags issues, or when accuracy drops below thresholds. Covers failure analysis, distinguishing systemic issues from corner cases, deciding whether to rewrite or patch, and knowing when to stop iterating. The evolution loop is the heartbeat of the system. Also use when transitioning between lifecycle phases (skill testing, workflow testing, production monitoring).
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# 演化循环:诊断-分类-修复-回归的持续改进机制
|
|
7
|
+
|
|
8
|
+
## 演化循环是系统的心跳
|
|
9
|
+
|
|
10
|
+
没有演化循环,技能和工作流就是静态的产物——写完即弃,遇到问题束手无策。演化循环赋予系统自我修正的能力:发现问题、定位原因、分类处理、修复验证,循环往复,直到达到稳态。
|
|
11
|
+
|
|
12
|
+
系统的准确率不是一次写对的,是一轮一轮迭代出来的。
|
|
13
|
+
|
|
14
|
+
## 循环的完整步骤
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
测试 → 观察 → 反思 → 诊断 → 分类 → 修复 → 回归测试 → 记录
|
|
18
|
+
↑ ↓
|
|
19
|
+
←←←←←←←←←←←← 未达标则继续循环 ←←←←←←←←←←←←←←←←←←←←←
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
### 退出条件(满足任一即停止)
|
|
23
|
+
|
|
24
|
+
1. 准确率达到 `.env` 中设定的阈值(`SKILL_ACCURACY` 或 `WORKFLOW_ACCURACY`)
|
|
25
|
+
2. 迭代轮次达到 `MAX_ITERATIONS` 上限
|
|
26
|
+
3. 剩余失败案例全部为不可消除的边界案例(已记录、已与开发者用户确认)
|
|
27
|
+
4. 收敛条件已满足(参见下文「收敛追踪」章节)
|
|
28
|
+
|
|
29
|
+
## 第一步:测试
|
|
30
|
+
|
|
31
|
+
运行技能或工作流,对测试集中的所有样本执行核查。
|
|
32
|
+
|
|
33
|
+
### 测试集的构成
|
|
34
|
+
|
|
35
|
+
- `assets/samples.json`:初始测试样本
|
|
36
|
+
- `assets/corner_cases.json`:历史迭代中积累的边界案例
|
|
37
|
+
- 如果是工作流测试,还需要与技能结果进行对比
|
|
38
|
+
|
|
39
|
+
### 测试执行规范
|
|
40
|
+
|
|
41
|
+
- 每轮测试必须运行完整测试集,不能只测新增的案例
|
|
42
|
+
- 记录每个案例的判定结果、置信度、执行耗时
|
|
43
|
+
- 将测试结果保存到 `logs/evolution/` 目录
|
|
44
|
+
|
|
45
|
+
## 第二步:观察
|
|
46
|
+
|
|
47
|
+
不要拿到测试结果就急于修改。先整体审视结果,寻找规律。
|
|
48
|
+
|
|
49
|
+
### 观察要点
|
|
50
|
+
|
|
51
|
+
- 通过率是多少?与上一轮相比是上升还是下降?
|
|
52
|
+
- 失败案例集中在哪些类型的单据上?
|
|
53
|
+
- 是否存在批量性的同类错误?
|
|
54
|
+
- 高置信度的案例中是否混入了错误判定?
|
|
55
|
+
- 上一轮修复的问题是否引入了新的回归?
|
|
56
|
+
|
|
57
|
+
### 用户反馈作为基准真值
|
|
58
|
+
|
|
59
|
+
当开发者用户或终端用户对核查结果报告错误时,将这些更正视为基准真值。用户报告的更正优先于编程智能体自身的质量判断。在演化循环中,优先诊断用户报告的错误,再处理智能体自行检测到的问题——用户反馈代表的是已确认的失败,而非疑似问题。
|
|
60
|
+
|
|
61
|
+
通过仪表盘中内置的反馈机制收集用户错误报告(参见 `dashboard-reporting`)。每条报告都应转化为测试案例,加入回归测试集。
|
|
62
|
+
|
|
63
|
+
### 记录观察结论
|
|
64
|
+
|
|
65
|
+
```json
|
|
66
|
+
{
|
|
67
|
+
"iteration": 3,
|
|
68
|
+
"total_cases": 25,
|
|
69
|
+
"pass": 21,
|
|
70
|
+
"fail": 3,
|
|
71
|
+
"unable_to_verify": 1,
|
|
72
|
+
"accuracy": 0.84,
|
|
73
|
+
"delta_from_last": +0.04,
|
|
74
|
+
"observation": "三个失败案例均涉及框架合同下的订单发票,疑似作用域定义不足"
|
|
75
|
+
}
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## 第三步:反思
|
|
79
|
+
|
|
80
|
+
在诊断新问题之前,回顾已经尝试过的方案。这是防止在相同的失败修复方案中循环的关键步骤。
|
|
81
|
+
|
|
82
|
+
读取 `logs/evolution/{rule_id}/` 中的结构化迭代日志,关注:
|
|
83
|
+
- 历史迭代中识别了哪些失败类型
|
|
84
|
+
- 尝试了哪些修复方案,效果如何——准确率上升、下降还是持平?
|
|
85
|
+
- 是否有修复方案因引入回归而被回退
|
|
86
|
+
|
|
87
|
+
**反模式检测**:如果当前失败匹配的模式在历史上已被诊断并"修复"过,说明之前的修复不充分。应升级修复策略——例如,从提示词微调升级为逻辑重写,或从逻辑重写升级为架构调整。不要对同类问题尝试同类修复两次。
|
|
88
|
+
|
|
89
|
+
**产出**:生成简要的迭代历史摘要,供诊断步骤参考。摘要应说明哪些方案已经尝试过、效果如何。第一轮迭代(无历史记录)跳过此步骤。
|
|
90
|
+
|
|
91
|
+
### 三个反思维度
|
|
92
|
+
|
|
93
|
+
审查失败案例时,从三个交叉维度进行分析:
|
|
94
|
+
|
|
95
|
+
1. **按规则维度——同一规则在不同文档上的表现**:某条规则(如 R003)是否在特定子集的文档上频繁失败?这揭示的是规则层面的缺陷——可能是提取提示词无法处理某种特定的文档格式。
|
|
96
|
+
|
|
97
|
+
2. **按文档维度——同一文档在不同规则上的表现**:某类文档是否导致多条规则同时失败?这揭示的是文档层面的问题——可能是某个模板的解析存在问题,影响了所有下游提取。
|
|
98
|
+
|
|
99
|
+
3. **全局维度——是否存在系统性关联**?例如:
|
|
100
|
+
- 失败集中在由特定模型层级处理的文档上。
|
|
101
|
+
- 失败集中在经过特定解析器层级处理的文档上。
|
|
102
|
+
- 仅当文档长度超过某个阈值时才出现失败。
|
|
103
|
+
|
|
104
|
+
交叉对比这三个维度,往往能揭示从单一视角看不到的根因。例如,"R003 在 X 类文档上失败"是按规则维度的发现,但如果 X 类文档同时导致 R007 和 R012 失败,那真正的问题出在文档类型上,而非任何单条规则。
|
|
105
|
+
|
|
106
|
+
## 第四步:诊断
|
|
107
|
+
|
|
108
|
+
对每个失败案例逐一分析根因。失败的原因通常归为以下四类:
|
|
109
|
+
|
|
110
|
+
### 解析失败(Parsing Failure)
|
|
111
|
+
|
|
112
|
+
系统未能正确识别或提取单据中的目标字段。
|
|
113
|
+
|
|
114
|
+
表现:提取的字段值与实际不符,或提取结果为空。
|
|
115
|
+
常见原因:单据格式变体未覆盖、OCR 质量差、字段位置偏移。
|
|
116
|
+
|
|
117
|
+
### 提取失败(Extraction Failure)
|
|
118
|
+
|
|
119
|
+
字段识别正确,但提取的值不完整或有误。
|
|
120
|
+
|
|
121
|
+
表现:日期少了年份、金额少了小数位、名称截断。
|
|
122
|
+
常见原因:提示词中的提取规则不够严格、字段格式超出预期。
|
|
123
|
+
|
|
124
|
+
### 判定失败(Judgment Failure)
|
|
125
|
+
|
|
126
|
+
字段提取正确,但核查结论错误。
|
|
127
|
+
|
|
128
|
+
表现:明明应该通过却判定为不通过,或反之。
|
|
129
|
+
常见原因:判定逻辑有缺陷、边界条件处理不当、例外情形未覆盖。
|
|
130
|
+
|
|
131
|
+
### 作用域失败(Scope Failure)
|
|
132
|
+
|
|
133
|
+
规则被应用到了不适用的场景,或者未被应用到应该覆盖的场景。
|
|
134
|
+
|
|
135
|
+
表现:对不适用的单据类型给出了核查结论,或对应该核查的单据跳过了检查。
|
|
136
|
+
常见原因:规则的适用范围描述不精确、前提条件缺失。
|
|
137
|
+
|
|
138
|
+
## 第五步:分类
|
|
139
|
+
|
|
140
|
+
根据失败的影响范围分为两类,决定不同的修复策略。
|
|
141
|
+
|
|
142
|
+
### 系统性问题(影响 >10% 的测试案例)
|
|
143
|
+
|
|
144
|
+
同一根因导致多个案例失败。这意味着核查逻辑本身存在缺陷,需要重写或大幅修改。
|
|
145
|
+
|
|
146
|
+
判定标准:同类错误出现在超过 10% 的测试案例中。
|
|
147
|
+
|
|
148
|
+
修复策略:**重写**——回到技能的核心逻辑,修改判定规则、补充遗漏条件、重构提取流程。
|
|
149
|
+
|
|
150
|
+
### 边界案例(影响 <10% 的测试案例)
|
|
151
|
+
|
|
152
|
+
个别特殊情况导致的失败。核查逻辑整体正确,只是没有覆盖到某些特殊情形。
|
|
153
|
+
|
|
154
|
+
判定标准:只影响个别案例,不存在共性模式。
|
|
155
|
+
|
|
156
|
+
修复策略:**记录**——将边界案例添加到 `corner_cases.json`,在 SKILL.md 的边界情况章节中补充说明。如果边界案例可以通过简单的条件分支处理,也可以小幅修改逻辑。
|
|
157
|
+
|
|
158
|
+
## 第六步:修复
|
|
159
|
+
|
|
160
|
+
### 系统性问题的修复
|
|
161
|
+
|
|
162
|
+
1. 明确根因和影响范围
|
|
163
|
+
2. 设计修复方案(可能涉及重写 SKILL.md 的判定逻辑、修改 scripts/ 中的代码)
|
|
164
|
+
3. 在修改前创建新版本(参照版本控制技能)
|
|
165
|
+
4. 实施修改
|
|
166
|
+
5. 先对失败案例做单点验证,确认修复有效
|
|
167
|
+
|
|
168
|
+
### 边界案例的处理
|
|
169
|
+
|
|
170
|
+
1. 将案例详细记录到 `corner_cases.json`
|
|
171
|
+
2. 在 SKILL.md 中补充边界情况说明
|
|
172
|
+
3. 如果需要代码层面的处理,添加条件分支
|
|
173
|
+
4. 如果属于业务上的模糊地带,标记为「待开发者用户确认」
|
|
174
|
+
|
|
175
|
+
## 第七步:回归测试
|
|
176
|
+
|
|
177
|
+
修复后必须运行完整测试集,而非只测修复的案例。
|
|
178
|
+
|
|
179
|
+
### 回归检查要点
|
|
180
|
+
|
|
181
|
+
- 修复的案例是否确实通过了?
|
|
182
|
+
- 之前通过的案例是否仍然通过?(无回归)
|
|
183
|
+
- 整体准确率是否提升?
|
|
184
|
+
- 如果出现回归,需要立即分析原因——修复不能以牺牲其他案例为代价
|
|
185
|
+
|
|
186
|
+
### 回归不可接受
|
|
187
|
+
|
|
188
|
+
如果修复导致已通过的案例回归为失败,说明修复方案有问题。回退修改,重新设计修复方案。
|
|
189
|
+
|
|
190
|
+
## 第八步:记录
|
|
191
|
+
|
|
192
|
+
每一轮迭代都必须完整记录,包括 JSON 格式的结构化日志和可读的文本日志。结构化日志是反思步骤(第三步)的数据来源——确保日志可被程序化扫描。
|
|
193
|
+
|
|
194
|
+
### 结构化日志
|
|
195
|
+
|
|
196
|
+
存放于 `logs/evolution/{rule_id}/iteration_{N}.json`:
|
|
197
|
+
|
|
198
|
+
```json
|
|
199
|
+
{
|
|
200
|
+
"rule_id": "R001",
|
|
201
|
+
"iteration": 3,
|
|
202
|
+
"phase": "skill_testing",
|
|
203
|
+
"timestamp": "2025-04-01T14:30:00Z",
|
|
204
|
+
"reflection": "第2轮尝试了相同的正则修复方案,准确率未变",
|
|
205
|
+
"test_results": {
|
|
206
|
+
"total": 25,
|
|
207
|
+
"pass": 23,
|
|
208
|
+
"fail": 1,
|
|
209
|
+
"unable_to_verify": 1,
|
|
210
|
+
"accuracy": 0.92
|
|
211
|
+
},
|
|
212
|
+
"failures": [
|
|
213
|
+
{
|
|
214
|
+
"case_id": "S012",
|
|
215
|
+
"diagnosis": "judgment_failure",
|
|
216
|
+
"classification": "corner_case",
|
|
217
|
+
"description": "框架合同展期后的发票,展期信息在补充协议中",
|
|
218
|
+
"action": "recorded_as_corner_case"
|
|
219
|
+
}
|
|
220
|
+
],
|
|
221
|
+
"changes_made": [
|
|
222
|
+
"补充了框架合同展期场景的判定逻辑",
|
|
223
|
+
"在 corner_cases.json 中添加了补充协议场景"
|
|
224
|
+
],
|
|
225
|
+
"regression_check": "无回归",
|
|
226
|
+
"decision": "准确率 92% ≥ 阈值 90%,进入蒸馏阶段"
|
|
227
|
+
}
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
### 文本日志
|
|
231
|
+
|
|
232
|
+
存放于 `logs/evolution/{rule_id}/iteration_{N}.md`,使用自然语言描述本轮迭代的关键发现和决策,便于开发者用户阅读。
|
|
233
|
+
|
|
234
|
+
## 收敛追踪
|
|
235
|
+
|
|
236
|
+
每轮迭代追踪三个指标,以判断何时停止:
|
|
237
|
+
|
|
238
|
+
- **修正量**:与上一轮相比,有多少测试案例的结果发生了变化(占总数的百分比)。
|
|
239
|
+
- **新模式数**:本轮识别出了多少此前未见过的失败模式。
|
|
240
|
+
- **回归数**:有多少上一轮通过的测试案例在本轮变为失败。
|
|
241
|
+
|
|
242
|
+
### 停止条件
|
|
243
|
+
|
|
244
|
+
当一轮迭代同时满足以下三个条件时,停止循环:
|
|
245
|
+
|
|
246
|
+
1. 修正量 < 总测试案例的 5%。
|
|
247
|
+
2. 新模式数 = 0。
|
|
248
|
+
3. 回归数 = 0。
|
|
249
|
+
|
|
250
|
+
如果修正量在连续两轮迭代之间**增加**,这是回归信号。暂停循环,先诊断原因再继续——上一轮的修复可能正在破坏系统的稳定性。
|
|
251
|
+
|
|
252
|
+
### 预期收敛速度
|
|
253
|
+
|
|
254
|
+
收敛不是线性的。预期早期迭代快速改善,随后收益递减:
|
|
255
|
+
|
|
256
|
+
| 文档数量 | 典型收敛轮次 |
|
|
257
|
+
|---------|------------|
|
|
258
|
+
| < 50 | 3–5 |
|
|
259
|
+
| 50–200 | 4–7 |
|
|
260
|
+
| 200–1000 | 5–10 |
|
|
261
|
+
| 1000+ | 7–15 |
|
|
262
|
+
|
|
263
|
+
以上为经验估算。实际收敛取决于规则复杂度、文档多样性和模型能力。如果收敛轮次显著超出预期,应重新审视方法,而非继续硬磨更多轮次。
|
|
264
|
+
|
|
265
|
+
详见 `references/convergence-guide.md` 的诊断流程和实际收敛数据。
|
|
266
|
+
|
|
267
|
+
## 三个运行阶段
|
|
268
|
+
|
|
269
|
+
演化循环在系统的不同生命周期阶段有不同的运行模式:
|
|
270
|
+
|
|
271
|
+
### 演化阶段(Active Evolution)
|
|
272
|
+
|
|
273
|
+
- 触发:技能或工作流正在开发和迭代中
|
|
274
|
+
- 频率:每次修改后立即运行
|
|
275
|
+
- 测试集:完整的 samples.json + corner_cases.json
|
|
276
|
+
- 目标:达到准确率阈值
|
|
277
|
+
|
|
278
|
+
### 监控阶段(Quality Control)
|
|
279
|
+
|
|
280
|
+
- 触发:工作流已部署到生产,通过质量监控发现问题
|
|
281
|
+
- 频率:由质量监控技能决定
|
|
282
|
+
- 测试集:生产中发现的问题案例 + 原有测试集
|
|
283
|
+
- 目标:修复生产中暴露的新问题
|
|
284
|
+
|
|
285
|
+
### 稳态阶段(Stability)
|
|
286
|
+
|
|
287
|
+
- 触发:连续 N 个批次无新问题
|
|
288
|
+
- 频率:仅在质量监控发出警报时运行
|
|
289
|
+
- 特征:迭代频率趋近于零
|
|
290
|
+
- 目标:维持准确率不下降
|
|
291
|
+
|
|
292
|
+
## 何时向开发者用户求助
|
|
293
|
+
|
|
294
|
+
不是所有问题都能靠迭代解决。以下情况应及时上报:
|
|
295
|
+
|
|
296
|
+
- **法规解读分歧**:你对法规条文的理解可能有误,需要业务专家确认
|
|
297
|
+
- **不可消除的失败**:达到 `MAX_ITERATIONS` 后仍有无法修复的案例
|
|
298
|
+
- **矛盾的规则**:两条规则对同一场景给出相反的要求
|
|
299
|
+
- **样本不足**:测试集覆盖的场景不够,无法判断修复是否有效
|
|
300
|
+
- **成本与准确率的权衡**:提升最后 5% 的准确率需要大幅增加成本,需开发者用户决策
|
|
301
|
+
|
|
302
|
+
上报时提供结构化的问题描述、你的分析、建议的选项,让开发者用户能高效决策。
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# Convergence Guide
|
|
2
|
+
|
|
3
|
+
Diagnostic procedures and real-world data for understanding when the evolution loop is converging, stalling, or regressing.
|
|
4
|
+
|
|
5
|
+
## Empirical Data
|
|
6
|
+
|
|
7
|
+
### The Shiji Project — Event Dating Reflection
|
|
8
|
+
|
|
9
|
+
A document verification project for historical event dating across regulatory filings. Five rounds of evolution:
|
|
10
|
+
|
|
11
|
+
- **Round 1**: 1,010 corrections (first pass — many extraction and judgment errors across the board).
|
|
12
|
+
- **Round 2**: 431 corrections (systematic fixes applied — regex patterns, prompt refinements).
|
|
13
|
+
- **Round 3**: 465 corrections (regression — round 2 fix for date normalization introduced new failures on edge-case date formats).
|
|
14
|
+
- **Round 4**: 167 corrections (stabilizing — round 3 regression diagnosed and resolved, remaining issues are corner cases).
|
|
15
|
+
- **Round 5**: 46 corrections (converged — below 5% threshold, no new patterns, no regressions).
|
|
16
|
+
|
|
17
|
+
**Key insight**: The round 3 spike was the most informative event. It revealed that the round 2 fix was too aggressive — it normalized dates that should not have been normalized. Without convergence tracking, this regression might have been masked by overall accuracy still improving on other cases.
|
|
18
|
+
|
|
19
|
+
## Diagnostic Flowchart
|
|
20
|
+
|
|
21
|
+
### If correction volume increases between iterations:
|
|
22
|
+
|
|
23
|
+
1. **Check for regression**: Are previously passing cases now failing? If yes, the last fix is the likely cause. Compare the diff between iterations.
|
|
24
|
+
2. **Check for fix conflicts**: Does the new fix contradict a prior fix? For example, broadening a regex in round N that was narrowed in round N-1.
|
|
25
|
+
3. **Check for test set changes**: Did new documents enter the test set between iterations? New documents can inflate correction volume without indicating regression.
|
|
26
|
+
|
|
27
|
+
### If correction volume stays flat (not decreasing):
|
|
28
|
+
|
|
29
|
+
1. **Check for oscillation**: Are the same cases flipping between pass and fail across iterations? This indicates the fix is unstable — it solves one variant but breaks another.
|
|
30
|
+
2. **Check if fix is too narrow**: The fix addresses the specific failing cases but does not generalize. The next iteration reveals similar cases the fix missed.
|
|
31
|
+
|
|
32
|
+
## False Convergence
|
|
33
|
+
|
|
34
|
+
Metrics look stable but underlying issues are masked. The system appears converged but will fail on production data.
|
|
35
|
+
|
|
36
|
+
### Common Causes
|
|
37
|
+
|
|
38
|
+
- **Test set too small**: With fewer than 20 test cases, a single case changing can swing metrics by 5%. Convergence at this scale is statistically meaningless.
|
|
39
|
+
- **Test set does not cover production variety**: The test set was curated from "clean" examples. Production documents include scanned PDFs, handwritten annotations, multi-language content, and formatting variations the test set never saw.
|
|
40
|
+
- **Corner cases excluded from metrics**: If difficult cases are moved to `corner_cases.json` and excluded from accuracy calculation, the remaining "easy" cases converge quickly but the real problem is hidden.
|
|
41
|
+
|
|
42
|
+
### Detection
|
|
43
|
+
|
|
44
|
+
Compare test set distribution to production distribution on key dimensions: document type, length, format, source. If they diverge significantly, convergence on the test set does not guarantee production quality.
|
|
45
|
+
|
|
46
|
+
## Estimating Remaining Rounds
|
|
47
|
+
|
|
48
|
+
### Simple Heuristic
|
|
49
|
+
|
|
50
|
+
If corrections approximately halve each round, expect `log2(current_corrections / threshold)` more rounds.
|
|
51
|
+
|
|
52
|
+
Example: current round has 200 corrections, threshold is 5% of 1000 cases = 50 corrections.
|
|
53
|
+
- Estimated remaining rounds: log2(200/50) = log2(4) = 2 rounds.
|
|
54
|
+
|
|
55
|
+
### When the Heuristic Fails
|
|
56
|
+
|
|
57
|
+
If corrections do not halve between rounds, the current approach may have hit its ceiling. Consider:
|
|
58
|
+
- Escalating the fix strategy (prompt tweak → logic rewrite → architecture change).
|
|
59
|
+
- Expanding the test set to reveal hidden patterns.
|
|
60
|
+
- Consulting the developer user for domain insight on stubborn failures.
|
|
61
|
+
|
|
62
|
+
Do not grind through more iterations expecting different results. If three consecutive rounds show similar correction volumes, stop and reassess.
|