kc-beta 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/kc-beta.js +16 -0
- package/package.json +32 -0
- package/src/agent/confidence-scorer.js +120 -0
- package/src/agent/context.js +124 -0
- package/src/agent/corner-case-registry.js +119 -0
- package/src/agent/engine.js +224 -0
- package/src/agent/events.js +27 -0
- package/src/agent/history.js +101 -0
- package/src/agent/llm-client.js +131 -0
- package/src/agent/pipelines/base.js +14 -0
- package/src/agent/pipelines/distillation.js +113 -0
- package/src/agent/pipelines/extraction.js +92 -0
- package/src/agent/pipelines/index.js +23 -0
- package/src/agent/pipelines/initializer.js +163 -0
- package/src/agent/pipelines/production-qc.js +99 -0
- package/src/agent/pipelines/skill-authoring.js +83 -0
- package/src/agent/pipelines/skill-testing.js +111 -0
- package/src/agent/tools/agent-tool.js +100 -0
- package/src/agent/tools/base.js +35 -0
- package/src/agent/tools/dashboard-render.js +146 -0
- package/src/agent/tools/document-parse.js +184 -0
- package/src/agent/tools/document-search.js +111 -0
- package/src/agent/tools/evolution-cycle.js +150 -0
- package/src/agent/tools/qc-sample.js +94 -0
- package/src/agent/tools/registry.js +55 -0
- package/src/agent/tools/rule-catalog.js +113 -0
- package/src/agent/tools/sandbox-exec.js +106 -0
- package/src/agent/tools/tier-downgrade.js +114 -0
- package/src/agent/tools/worker-llm-call.js +109 -0
- package/src/agent/tools/workflow-run.js +138 -0
- package/src/agent/tools/workspace-file.js +122 -0
- package/src/agent/version-manager.js +130 -0
- package/src/agent/workspace.js +82 -0
- package/src/cli/components.js +164 -0
- package/src/cli/index.js +329 -0
- package/src/cli/init.js +80 -0
- package/src/cli/onboard.js +182 -0
- package/src/cli/terminal.js +143 -0
- package/src/config.js +93 -0
- package/template/.env.template +31 -0
- package/template/CLAUDE.md +137 -0
- package/template/Input/.gitkeep +0 -0
- package/template/Output/.gitkeep +0 -0
- package/template/Rules/.gitkeep +0 -0
- package/template/Samples/.gitkeep +0 -0
- package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
- package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
- package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
- package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
- package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
- package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
- package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
- package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
- package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
- package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
- package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
- package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
- package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
- package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
- package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
- package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
- package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
- package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
- package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/en/skill-creator/LICENSE.txt +202 -0
- package/template/skills/en/skill-creator/SKILL.md +479 -0
- package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/en/skill-creator/agents/comparator.md +202 -0
- package/template/skills/en/skill-creator/agents/grader.md +223 -0
- package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/en/skill-creator/references/schemas.md +430 -0
- package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/en/skill-creator/scripts/utils.py +47 -0
- package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
- package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
- package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
- package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
- package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
- package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
- package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
- package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
- package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
- package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
- package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
- package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
- package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
- package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
- package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
- package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
- package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
- package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
- package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
- package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
- package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
- package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
- package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
- package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
- package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
- package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
- package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
- package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
- package/template/skills/zh/skill-creator/SKILL.md +479 -0
- package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
- package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
- package/template/skills/zh/skill-creator/agents/grader.md +223 -0
- package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
- package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
- package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
- package/template/skills/zh/skill-creator/references/schemas.md +430 -0
- package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
- package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
- package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
- package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
- package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
- package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
- package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
- package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: dashboard-reporting
|
|
3
|
+
description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Dashboard Reporting
|
|
7
|
+
|
|
8
|
+
The dashboard is the developer user's window into the system. They should not need to read logs or parse JSON to understand what is happening. Give them a clear, visual summary that leads with what matters.
|
|
9
|
+
|
|
10
|
+
## Dashboard Types
|
|
11
|
+
|
|
12
|
+
### Results Dashboard
|
|
13
|
+
Generated after each batch of documents is processed.
|
|
14
|
+
|
|
15
|
+
Key elements:
|
|
16
|
+
- **Summary bar**: Total documents, pass rate, fail rate, missing rate, error rate.
|
|
17
|
+
- **Per-rule breakdown**: Table showing each rule's pass/fail counts, accuracy, and average confidence.
|
|
18
|
+
- **Failed cases**: List of documents that failed, with the rule, extracted value, expected value, and comment. Sortable and filterable.
|
|
19
|
+
- **Confidence distribution**: Histogram showing how many results fall in each confidence band.
|
|
20
|
+
|
|
21
|
+
### Progress Dashboard
|
|
22
|
+
Generated on demand to show the system's evolution.
|
|
23
|
+
|
|
24
|
+
Key elements:
|
|
25
|
+
- **Lifecycle status**: Which rules are in which phase (skill testing, workflow testing, production, stable).
|
|
26
|
+
- **Rule catalog**: Table of all rules with their current status, accuracy, and version.
|
|
27
|
+
- **Evolution timeline**: For each rule, how many iterations it took, what was the accuracy at each step.
|
|
28
|
+
- **Model tier assignments**: Which model is being used for each step of each rule.
|
|
29
|
+
|
|
30
|
+
### Quality Dashboard
|
|
31
|
+
Generated after quality control reviews.
|
|
32
|
+
|
|
33
|
+
Key elements:
|
|
34
|
+
- **Accuracy over time**: Line chart showing per-rule and overall accuracy across batches.
|
|
35
|
+
- **Sampling rate over time**: Showing how monitoring is decreasing (or not).
|
|
36
|
+
- **Flagged issues**: Open issues that need developer user attention.
|
|
37
|
+
- **Cost metrics**: LLM calls and tokens per document, per rule.
|
|
38
|
+
|
|
39
|
+
## Feedback Collection
|
|
40
|
+
|
|
41
|
+
Every dashboard must include mechanisms for users to report errors and comment directly on verification results. This is not a nice-to-have — user feedback is the most valuable data source in the system.
|
|
42
|
+
|
|
43
|
+
### Developer User Feedback
|
|
44
|
+
|
|
45
|
+
Developer users see full result detail. Their feedback interface should support:
|
|
46
|
+
- **Field-level correction**: Click on an extracted value, provide the correct value.
|
|
47
|
+
- **Result override**: Change a pass to fail (or vice versa) with a reason.
|
|
48
|
+
- **Rule re-evaluation request**: Flag a result for re-processing with a different approach.
|
|
49
|
+
- **Comment**: Free-text annotation on any result.
|
|
50
|
+
|
|
51
|
+
### End User Feedback
|
|
52
|
+
|
|
53
|
+
End users of the verification app see simplified results. Their interface should support:
|
|
54
|
+
- **Flag as wrong**: One-click to report a result they believe is incorrect.
|
|
55
|
+
- **Add comment**: Brief text explaining what they think is wrong.
|
|
56
|
+
- **Severity indicator**: How impactful is this error? (Critical / Important / Minor)
|
|
57
|
+
|
|
58
|
+
### Feedback as Ground Truth
|
|
59
|
+
|
|
60
|
+
User-reported errors are ground truth. They override the coding agent's judgment and the worker LLM's output. The feedback data flow:
|
|
61
|
+
|
|
62
|
+
1. User submits feedback via dashboard → stored as structured record.
|
|
63
|
+
2. Record schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
|
|
64
|
+
3. Feedback records are fed into the `evolution-loop` as confirmed failures.
|
|
65
|
+
4. Dashboard surfaces feedback trends: correction rate over time, most-reported issues, rules with highest user correction rates.
|
|
66
|
+
|
|
67
|
+
Build the feedback collection mechanism alongside the dashboard generation — they are not separate features. Every generated HTML dashboard should include the feedback UI, even if it initially writes to a local JSON file that the coding agent reads on the next iteration.
|
|
68
|
+
|
|
69
|
+
## Technology
|
|
70
|
+
|
|
71
|
+
Self-contained HTML with embedded CSS and JavaScript. Requirements:
|
|
72
|
+
- **No external dependencies.** No CDN links, no npm packages, no server. Everything is inlined.
|
|
73
|
+
- **No server required.** The developer user double-clicks the HTML file to open it in their browser.
|
|
74
|
+
- **Responsive layout.** Should work on both desktop and mobile screens.
|
|
75
|
+
- **Dark/light mode.** Respect the system preference or provide a toggle.
|
|
76
|
+
|
|
77
|
+
For charts, use inline SVG or a lightweight chart library that can be embedded (e.g., Chart.js or lightweight alternatives, inlined as a script tag).
|
|
78
|
+
|
|
79
|
+
## Data Sources
|
|
80
|
+
|
|
81
|
+
Dashboards read from:
|
|
82
|
+
- `Output/` for verification results.
|
|
83
|
+
- `logs/` for evolution and testing history.
|
|
84
|
+
- `versions.json` for current system state.
|
|
85
|
+
- QC review records (stored alongside Output/).
|
|
86
|
+
|
|
87
|
+
The dashboard generation script should accept input paths and produce a single HTML file.
|
|
88
|
+
|
|
89
|
+
## Generation Triggers
|
|
90
|
+
|
|
91
|
+
Generate dashboards automatically after:
|
|
92
|
+
- Each testing round completes (skill testing or workflow testing).
|
|
93
|
+
- Each production batch finishes processing.
|
|
94
|
+
- Each quality control review cycle.
|
|
95
|
+
- Developer user explicitly requests it.
|
|
96
|
+
|
|
97
|
+
Store generated dashboards in `Output/dashboards/` with timestamps in filenames for history.
|
|
98
|
+
|
|
99
|
+
## Design Principles
|
|
100
|
+
|
|
101
|
+
- **Lead with the summary.** The developer user should understand the system's health in 3 seconds.
|
|
102
|
+
- **Drill down on demand.** Summary → rule-level → document-level. Do not overwhelm with details upfront.
|
|
103
|
+
- **Color coding.** Green for pass/healthy, red for fail/critical, yellow for warning/attention. Simple and universal.
|
|
104
|
+
- **Actionable.** Every flagged issue should suggest what to do next.
|
|
105
|
+
|
|
106
|
+
A starter script is available in `scripts/generate_dashboard.py`. Adapt it to the specific business scenario.
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Dashboard Generator — Starter Script
|
|
3
|
+
|
|
4
|
+
Generates a self-contained HTML dashboard from verification results.
|
|
5
|
+
This is a starting point. The coding agent should customize it for the
|
|
6
|
+
specific business scenario.
|
|
7
|
+
|
|
8
|
+
Usage:
|
|
9
|
+
python generate_dashboard.py --input <output_dir> --output <dashboard.html>
|
|
10
|
+
|
|
11
|
+
Input: A directory containing verification result JSON files.
|
|
12
|
+
Output: A single self-contained HTML file.
|
|
13
|
+
"""
|
|
14
|
+
|
|
15
|
+
import argparse
|
|
16
|
+
import json
|
|
17
|
+
import os
|
|
18
|
+
from datetime import datetime
|
|
19
|
+
from pathlib import Path
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
def load_results(input_dir: str) -> list[dict]:
|
|
23
|
+
"""Load all JSON result files from the input directory."""
|
|
24
|
+
results = []
|
|
25
|
+
for f in Path(input_dir).glob("**/*.json"):
|
|
26
|
+
if "dashboard" in str(f) or "versions" in str(f):
|
|
27
|
+
continue
|
|
28
|
+
try:
|
|
29
|
+
with open(f) as fh:
|
|
30
|
+
data = json.load(fh)
|
|
31
|
+
if isinstance(data, list):
|
|
32
|
+
results.extend(data)
|
|
33
|
+
elif isinstance(data, dict) and "results" in data:
|
|
34
|
+
results.extend(data["results"])
|
|
35
|
+
elif isinstance(data, dict) and "rule_id" in data:
|
|
36
|
+
results.append(data)
|
|
37
|
+
except (json.JSONDecodeError, KeyError):
|
|
38
|
+
continue
|
|
39
|
+
return results
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
def compute_summary(results: list[dict]) -> dict:
|
|
43
|
+
"""Compute summary statistics from results."""
|
|
44
|
+
total = len(results)
|
|
45
|
+
if total == 0:
|
|
46
|
+
return {"total": 0, "pass": 0, "fail": 0, "missing": 0, "error": 0}
|
|
47
|
+
|
|
48
|
+
summary = {
|
|
49
|
+
"total": total,
|
|
50
|
+
"pass": sum(1 for r in results if r.get("result") == "pass"),
|
|
51
|
+
"fail": sum(1 for r in results if r.get("result") == "fail"),
|
|
52
|
+
"missing": sum(1 for r in results if r.get("result") == "missing"),
|
|
53
|
+
"error": sum(1 for r in results if r.get("result") == "error"),
|
|
54
|
+
}
|
|
55
|
+
summary["pass_rate"] = round(summary["pass"] / total * 100, 1)
|
|
56
|
+
return summary
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def compute_per_rule(results: list[dict]) -> dict:
|
|
60
|
+
"""Compute per-rule statistics."""
|
|
61
|
+
rules = {}
|
|
62
|
+
for r in results:
|
|
63
|
+
rule_id = r.get("rule_id", "unknown")
|
|
64
|
+
if rule_id not in rules:
|
|
65
|
+
rules[rule_id] = {"total": 0, "pass": 0, "fail": 0, "results": []}
|
|
66
|
+
rules[rule_id]["total"] += 1
|
|
67
|
+
if r.get("result") == "pass":
|
|
68
|
+
rules[rule_id]["pass"] += 1
|
|
69
|
+
elif r.get("result") == "fail":
|
|
70
|
+
rules[rule_id]["fail"] += 1
|
|
71
|
+
rules[rule_id]["results"].append(r)
|
|
72
|
+
|
|
73
|
+
for rule_id, data in rules.items():
|
|
74
|
+
data["accuracy"] = round(data["pass"] / data["total"] * 100, 1) if data["total"] > 0 else 0
|
|
75
|
+
|
|
76
|
+
return rules
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> str:
|
|
80
|
+
"""Generate a self-contained HTML dashboard."""
|
|
81
|
+
|
|
82
|
+
rule_rows = ""
|
|
83
|
+
for rule_id, data in sorted(per_rule.items()):
|
|
84
|
+
color = "#4caf50" if data["accuracy"] >= 90 else "#ff9800" if data["accuracy"] >= 70 else "#f44336"
|
|
85
|
+
rule_rows += f"""
|
|
86
|
+
<tr>
|
|
87
|
+
<td>{rule_id}</td>
|
|
88
|
+
<td>{data['total']}</td>
|
|
89
|
+
<td>{data['pass']}</td>
|
|
90
|
+
<td>{data['fail']}</td>
|
|
91
|
+
<td style="color: {color}; font-weight: bold;">{data['accuracy']}%</td>
|
|
92
|
+
</tr>"""
|
|
93
|
+
|
|
94
|
+
fail_rows = ""
|
|
95
|
+
for case in failed_cases[:100]: # Limit to first 100
|
|
96
|
+
fail_rows += f"""
|
|
97
|
+
<tr>
|
|
98
|
+
<td>{case.get('document', 'N/A')}</td>
|
|
99
|
+
<td>{case.get('rule_id', 'N/A')}</td>
|
|
100
|
+
<td>{case.get('extracted_value', 'N/A')}</td>
|
|
101
|
+
<td>{case.get('comment', 'N/A')}</td>
|
|
102
|
+
<td>{case.get('confidence', 'N/A')}</td>
|
|
103
|
+
</tr>"""
|
|
104
|
+
|
|
105
|
+
html = f"""<!DOCTYPE html>
|
|
106
|
+
<html lang="en">
|
|
107
|
+
<head>
|
|
108
|
+
<meta charset="UTF-8">
|
|
109
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
110
|
+
<title>KC Reborn — Verification Dashboard</title>
|
|
111
|
+
<style>
|
|
112
|
+
:root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
|
|
113
|
+
@media (prefers-color-scheme: light) {{
|
|
114
|
+
:root {{ --bg: #f5f5f5; --surface: #ffffff; --text: #333333; }}
|
|
115
|
+
}}
|
|
116
|
+
* {{ margin: 0; padding: 0; box-sizing: border-box; }}
|
|
117
|
+
body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: var(--bg); color: var(--text); padding: 20px; }}
|
|
118
|
+
h1 {{ margin-bottom: 20px; }}
|
|
119
|
+
.summary {{ display: flex; gap: 16px; margin-bottom: 24px; flex-wrap: wrap; }}
|
|
120
|
+
.card {{ background: var(--surface); padding: 20px; border-radius: 8px; min-width: 140px; text-align: center; }}
|
|
121
|
+
.card .number {{ font-size: 2em; font-weight: bold; }}
|
|
122
|
+
.card .label {{ font-size: 0.85em; opacity: 0.7; margin-top: 4px; }}
|
|
123
|
+
table {{ width: 100%; border-collapse: collapse; background: var(--surface); border-radius: 8px; overflow: hidden; margin-bottom: 24px; }}
|
|
124
|
+
th, td {{ padding: 10px 14px; text-align: left; border-bottom: 1px solid rgba(255,255,255,0.1); }}
|
|
125
|
+
th {{ background: rgba(0,0,0,0.2); font-weight: 600; }}
|
|
126
|
+
h2 {{ margin: 20px 0 12px; }}
|
|
127
|
+
.timestamp {{ opacity: 0.5; font-size: 0.85em; margin-bottom: 20px; }}
|
|
128
|
+
</style>
|
|
129
|
+
</head>
|
|
130
|
+
<body>
|
|
131
|
+
<h1>Verification Dashboard</h1>
|
|
132
|
+
<div class="timestamp">Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</div>
|
|
133
|
+
|
|
134
|
+
<div class="summary">
|
|
135
|
+
<div class="card"><div class="number">{summary['total']}</div><div class="label">Total</div></div>
|
|
136
|
+
<div class="card"><div class="number" style="color: var(--accent);">{summary['pass']}</div><div class="label">Pass</div></div>
|
|
137
|
+
<div class="card"><div class="number" style="color: var(--err);">{summary['fail']}</div><div class="label">Fail</div></div>
|
|
138
|
+
<div class="card"><div class="number" style="color: var(--warn);">{summary.get('missing', 0)}</div><div class="label">Missing</div></div>
|
|
139
|
+
<div class="card"><div class="number">{summary.get('pass_rate', 0)}%</div><div class="label">Pass Rate</div></div>
|
|
140
|
+
</div>
|
|
141
|
+
|
|
142
|
+
<h2>Per-Rule Breakdown</h2>
|
|
143
|
+
<table>
|
|
144
|
+
<thead><tr><th>Rule</th><th>Total</th><th>Pass</th><th>Fail</th><th>Accuracy</th></tr></thead>
|
|
145
|
+
<tbody>{rule_rows}</tbody>
|
|
146
|
+
</table>
|
|
147
|
+
|
|
148
|
+
<h2>Failed Cases</h2>
|
|
149
|
+
<table>
|
|
150
|
+
<thead><tr><th>Document</th><th>Rule</th><th>Extracted Value</th><th>Comment</th><th>Confidence</th></tr></thead>
|
|
151
|
+
<tbody>{fail_rows if fail_rows else '<tr><td colspan="5" style="text-align:center; opacity:0.5;">No failures</td></tr>'}</tbody>
|
|
152
|
+
</table>
|
|
153
|
+
</body>
|
|
154
|
+
</html>"""
|
|
155
|
+
return html
|
|
156
|
+
|
|
157
|
+
|
|
158
|
+
def main():
|
|
159
|
+
parser = argparse.ArgumentParser(description="Generate verification dashboard")
|
|
160
|
+
parser.add_argument("--input", required=True, help="Directory containing result JSON files")
|
|
161
|
+
parser.add_argument("--output", default="dashboard.html", help="Output HTML file path")
|
|
162
|
+
args = parser.parse_args()
|
|
163
|
+
|
|
164
|
+
results = load_results(args.input)
|
|
165
|
+
summary = compute_summary(results)
|
|
166
|
+
per_rule = compute_per_rule(results)
|
|
167
|
+
failed_cases = [r for r in results if r.get("result") == "fail"]
|
|
168
|
+
|
|
169
|
+
html = generate_html(summary, per_rule, failed_cases)
|
|
170
|
+
|
|
171
|
+
output_path = Path(args.output)
|
|
172
|
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
173
|
+
output_path.write_text(html, encoding="utf-8")
|
|
174
|
+
print(f"Dashboard generated: {output_path}")
|
|
175
|
+
|
|
176
|
+
|
|
177
|
+
if __name__ == "__main__":
|
|
178
|
+
main()
|
|
@@ -0,0 +1,210 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolution-loop
|
|
3
|
+
description: Drive continuous improvement of skills and workflows through the diagnose-classify-fix-retest cycle. Use after any testing round reveals failures, when production quality control flags issues, or when accuracy drops below thresholds. Covers failure analysis, distinguishing systemic issues from corner cases, deciding whether to rewrite or patch, and knowing when to stop iterating. The evolution loop is the heartbeat of the system. Also use when transitioning between lifecycle phases (skill testing, workflow testing, production monitoring).
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Evolution Loop
|
|
7
|
+
|
|
8
|
+
The evolution loop is what makes this system self-improving rather than static. Every failure is information. The question is always: what kind of failure is this, and what is the most efficient fix?
|
|
9
|
+
|
|
10
|
+
## The Loop
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
Test → Observe → Reflect → Diagnose → Classify → Fix → Re-test → Log
|
|
14
|
+
↑ |
|
|
15
|
+
└──────────────────────────────────────────────────────────────┘
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
Run this loop until one of:
|
|
19
|
+
- Accuracy meets the threshold in `.env` (SKILL_ACCURACY or WORKFLOW_ACCURACY, depending on the phase).
|
|
20
|
+
- MAX_ITERATIONS is reached (escalate to developer user).
|
|
21
|
+
- You determine that the remaining failures are irreducible given the current approach (escalate to developer user).
|
|
22
|
+
- Convergence criteria are met (see Convergence Tracking below).
|
|
23
|
+
|
|
24
|
+
## Step 1: Test
|
|
25
|
+
|
|
26
|
+
Run the skill or workflow on the relevant document set:
|
|
27
|
+
- During skill development: run on Samples/.
|
|
28
|
+
- During workflow testing: run on Samples/, compare to skill-based ground truth.
|
|
29
|
+
- During production: run on Input/ batches.
|
|
30
|
+
|
|
31
|
+
Record every result: pass/fail/missing/error, the extracted values, the confidence scores, and any comments.
|
|
32
|
+
|
|
33
|
+
## Step 2: Observe
|
|
34
|
+
|
|
35
|
+
Review the results holistically:
|
|
36
|
+
- What is the overall accuracy?
|
|
37
|
+
- Which rules are failing most?
|
|
38
|
+
- Are there patterns in the failures? Same section type? Same document format? Same kind of entity?
|
|
39
|
+
- Are there documents that fail multiple rules?
|
|
40
|
+
|
|
41
|
+
Do not just count. Understand.
|
|
42
|
+
|
|
43
|
+
### User-Reported Errors
|
|
44
|
+
|
|
45
|
+
When developer users or end users report errors on verification results, treat these corrections as ground truth. User-reported corrections override the coding agent's own quality judgments. In the evolution loop, prioritize diagnosing user-reported errors before agent-detected ones — they represent confirmed failures, not suspected ones.
|
|
46
|
+
|
|
47
|
+
Collect user error reports from the feedback mechanisms built into the dashboard (see `dashboard-reporting`). Each report should be converted into a test case and added to the regression set.
|
|
48
|
+
|
|
49
|
+
## Step 3: Reflect
|
|
50
|
+
|
|
51
|
+
Before diagnosing new failures, review what has already been tried. This prevents cycling through the same failed fixes.
|
|
52
|
+
|
|
53
|
+
Read the structured iteration logs from `logs/evolution/{rule_id}/`. Focus on:
|
|
54
|
+
- Which failure categories were identified in previous iterations.
|
|
55
|
+
- What fixes were attempted and their outcomes — did accuracy go up, down, or stay flat?
|
|
56
|
+
- Whether any fix was reverted due to regression.
|
|
57
|
+
|
|
58
|
+
**Anti-pattern detection**: If the current failures match a pattern that was already diagnosed and "fixed" in a prior iteration, the prior fix was insufficient. Escalate the approach — for example, from a prompt tweak to a logic rewrite, or from a logic rewrite to an architecture change. Do not try the same category of fix twice.
|
|
59
|
+
|
|
60
|
+
**Output**: Produce a brief iteration history summary to feed into the Diagnose step. This gives the diagnosis context about what NOT to try again. On the first iteration (no history), skip this step.
|
|
61
|
+
|
|
62
|
+
### Three Dimensions of Reflection
|
|
63
|
+
|
|
64
|
+
When reviewing failures, analyze them along three cross-cutting dimensions:
|
|
65
|
+
|
|
66
|
+
1. **Per-rule across documents**: Is a specific rule (e.g., R003) failing on a particular subset of documents? This reveals rule-specific weaknesses — perhaps the extraction prompt does not handle a particular document format.
|
|
67
|
+
|
|
68
|
+
2. **Per-document across rules**: Is a specific document type causing failures across multiple rules? This reveals document-level issues — perhaps a particular template has parsing problems that affect all downstream extraction.
|
|
69
|
+
|
|
70
|
+
3. **Global patterns**: Are there systemic correlations? For example:
|
|
71
|
+
- Failures clustering on documents processed by a specific model tier.
|
|
72
|
+
- Failures clustering on documents that went through a specific parser level.
|
|
73
|
+
- Failures that only appear when document length exceeds a threshold.
|
|
74
|
+
|
|
75
|
+
Cross-referencing these dimensions often reveals root causes invisible from a single perspective. For example, "R003 fails on type-X documents" is a per-rule finding, but if type-X documents also cause R007 and R012 to fail, the real issue is the document type, not any individual rule.
|
|
76
|
+
|
|
77
|
+
## Step 4: Diagnose
|
|
78
|
+
|
|
79
|
+
For each failure, determine the root cause:
|
|
80
|
+
- **Parsing failure**: the document was not parsed correctly. The text was garbled, tables were mangled, or content was missing.
|
|
81
|
+
- **Extraction failure**: the entity was not found, or the wrong value was extracted. The section was located correctly but the entity extraction failed.
|
|
82
|
+
- **Judgment failure**: the entity was extracted correctly but the pass/fail determination was wrong. The logic is flawed or the edge case is not handled.
|
|
83
|
+
- **Scope failure**: the rule was applied to a section where it does not apply, or not applied to a section where it should.
|
|
84
|
+
|
|
85
|
+
## Step 5: Classify
|
|
86
|
+
|
|
87
|
+
This is the critical decision:
|
|
88
|
+
|
|
89
|
+
### Systemic Issue (affects >10% of documents)
|
|
90
|
+
The failure has a common cause that affects many documents. Examples:
|
|
91
|
+
- The document parser consistently fails on a certain table format.
|
|
92
|
+
- The extraction prompt misunderstands a common phrasing.
|
|
93
|
+
- The threshold logic has a bug.
|
|
94
|
+
|
|
95
|
+
**Action**: Fix the root cause. Rewrite the relevant part of the skill or workflow. This is a code/prompt change.
|
|
96
|
+
|
|
97
|
+
### Corner Case (affects <10% of documents)
|
|
98
|
+
The failure is specific to a few unusual documents. Examples:
|
|
99
|
+
- One document uses a non-standard date format.
|
|
100
|
+
- One document has the relevant information in a footnote instead of the main text.
|
|
101
|
+
- One document is a special report type with different structure.
|
|
102
|
+
|
|
103
|
+
**Action**: Do NOT patch the main workflow. Record the pattern and resolution in `corner_cases.json`. See `corner-case-management` skill.
|
|
104
|
+
|
|
105
|
+
### The 10% Threshold
|
|
106
|
+
|
|
107
|
+
This is a heuristic, not a law. Use judgment:
|
|
108
|
+
- If 8% of documents fail the same way but the fix is simple, treat it as systemic.
|
|
109
|
+
- If 12% of documents fail but each for a different reason, treat them as individual corner cases.
|
|
110
|
+
- When in doubt, discuss with the developer user.
|
|
111
|
+
|
|
112
|
+
## Step 6: Fix
|
|
113
|
+
|
|
114
|
+
For systemic issues:
|
|
115
|
+
1. Identify the specific component that needs to change (parser config, extraction prompt, judgment logic, regex pattern).
|
|
116
|
+
2. Make the change.
|
|
117
|
+
3. Version the change (see `version-control` skill).
|
|
118
|
+
|
|
119
|
+
For corner cases:
|
|
120
|
+
1. Document the pattern in `corner_cases.json`.
|
|
121
|
+
2. Add a detection mechanism (regex, keyword, structural pattern).
|
|
122
|
+
3. Define the resolution (alternative extraction method, special judgment logic).
|
|
123
|
+
4. Set a confidence threshold for matching.
|
|
124
|
+
|
|
125
|
+
## Step 7: Re-test
|
|
126
|
+
|
|
127
|
+
After fixing:
|
|
128
|
+
1. Re-run on the full document set, not just the failing documents. Fixes can introduce regressions.
|
|
129
|
+
2. Compare to previous iteration results.
|
|
130
|
+
3. If accuracy improved, continue. If accuracy regressed, roll back (see `version-control`).
|
|
131
|
+
|
|
132
|
+
## Step 8: Log
|
|
133
|
+
|
|
134
|
+
For every iteration, record the following in a structured format that the Reflect step can consume:
|
|
135
|
+
|
|
136
|
+
```json
|
|
137
|
+
{
|
|
138
|
+
"iteration": 3,
|
|
139
|
+
"trigger": "regression_detected",
|
|
140
|
+
"observation_summary": "3 documents failing on date extraction",
|
|
141
|
+
"reflection": "Iteration 2 attempted regex fix for same pattern, accuracy unchanged",
|
|
142
|
+
"failures": [
|
|
143
|
+
{"case_id": "doc_007", "diagnosis": "extraction", "root_cause": "non-standard date format", "fix": "added regex pattern", "outcome": "resolved"}
|
|
144
|
+
],
|
|
145
|
+
"mutations": [
|
|
146
|
+
{"component": "scripts/check.py", "change_type": "patch", "description": "Added YYYY/MM/DD pattern"}
|
|
147
|
+
],
|
|
148
|
+
"outcome": {"accuracy_before": 0.82, "accuracy_after": 0.91, "regressions": 0}
|
|
149
|
+
}
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
This schema is a recommended starting point — adapt the fields to your specific needs. The important thing is that logs are machine-parseable (JSON) so the Reflect step can programmatically scan history.
|
|
153
|
+
|
|
154
|
+
Also maintain a plain text summary alongside the JSON for developer user readability. Store both in `logs/evolution/{rule_id}/`.
|
|
155
|
+
|
|
156
|
+
## Convergence Tracking
|
|
157
|
+
|
|
158
|
+
Track three metrics per iteration to know when to stop:
|
|
159
|
+
|
|
160
|
+
- **Correction volume**: How many test cases changed result compared to last iteration (as a percentage of total).
|
|
161
|
+
- **New pattern count**: How many previously unseen failure patterns were identified this iteration.
|
|
162
|
+
- **Regression count**: How many test cases that passed last iteration now fail.
|
|
163
|
+
|
|
164
|
+
### Stopping Criteria
|
|
165
|
+
|
|
166
|
+
Stop the loop when ALL three conditions hold for one iteration:
|
|
167
|
+
|
|
168
|
+
1. Correction volume < 5% of total test cases.
|
|
169
|
+
2. New pattern count = 0.
|
|
170
|
+
3. Regression count = 0.
|
|
171
|
+
|
|
172
|
+
If correction volume *increases* between consecutive iterations, this is a regression signal. Pause the loop and diagnose before continuing — the last fix may be destabilizing the system.
|
|
173
|
+
|
|
174
|
+
### Expected Convergence
|
|
175
|
+
|
|
176
|
+
Convergence is not linear. Expect rapid improvement in early iterations followed by diminishing returns:
|
|
177
|
+
|
|
178
|
+
| Document count | Typical iterations to converge |
|
|
179
|
+
|---------------|-------------------------------|
|
|
180
|
+
| < 50 | 3–5 |
|
|
181
|
+
| 50–200 | 4–7 |
|
|
182
|
+
| 200–1000 | 5–10 |
|
|
183
|
+
| 1000+ | 7–15 |
|
|
184
|
+
|
|
185
|
+
These are empirical estimates. Actual convergence depends on rule complexity, document variability, and model capability. If convergence takes significantly more iterations than expected, revisit your approach rather than grinding through more rounds.
|
|
186
|
+
|
|
187
|
+
See `references/convergence-guide.md` for diagnostic procedures and real-world convergence data.
|
|
188
|
+
|
|
189
|
+
## The Three Phases of Life
|
|
190
|
+
|
|
191
|
+
### Phase 1: Evolution (Active Improvement)
|
|
192
|
+
The system is being built. Every testing round reveals issues. The loop runs frequently. This is where you spend most of your effort.
|
|
193
|
+
|
|
194
|
+
### Phase 2: Monitoring (Quality Control)
|
|
195
|
+
The system is deployed. Workflows are running on production data. The loop runs on sampled results (see `quality-control` skill). You intervene only when quality drops.
|
|
196
|
+
|
|
197
|
+
### Phase 3: Stability (Near-Zero Oversight)
|
|
198
|
+
The system is mature. Quality has been stable for a sustained period. The loop runs rarely, on small random samples. You intervene only when business rules change.
|
|
199
|
+
|
|
200
|
+
Transition between phases is gradual, not sudden. And it can reverse — a regulatory change can push you from Phase 3 back to Phase 1.
|
|
201
|
+
|
|
202
|
+
## Escalation
|
|
203
|
+
|
|
204
|
+
Escalate to the developer user when:
|
|
205
|
+
- MAX_ITERATIONS is reached without meeting the accuracy threshold.
|
|
206
|
+
- You cannot determine the root cause of a failure.
|
|
207
|
+
- The fix requires domain knowledge you do not have.
|
|
208
|
+
- Multiple rules interact in ways you cannot resolve.
|
|
209
|
+
|
|
210
|
+
When escalating, provide: the rule, the failing documents, the diagnosis, what you tried, and why it did not work.
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# Convergence Guide
|
|
2
|
+
|
|
3
|
+
Diagnostic procedures and real-world data for understanding when the evolution loop is converging, stalling, or regressing.
|
|
4
|
+
|
|
5
|
+
## Empirical Data
|
|
6
|
+
|
|
7
|
+
### The Shiji Project — Event Dating Reflection
|
|
8
|
+
|
|
9
|
+
A document verification project for historical event dating across regulatory filings. Five rounds of evolution:
|
|
10
|
+
|
|
11
|
+
- **Round 1**: 1,010 corrections (first pass — many extraction and judgment errors across the board).
|
|
12
|
+
- **Round 2**: 431 corrections (systematic fixes applied — regex patterns, prompt refinements).
|
|
13
|
+
- **Round 3**: 465 corrections (regression — round 2 fix for date normalization introduced new failures on edge-case date formats).
|
|
14
|
+
- **Round 4**: 167 corrections (stabilizing — round 3 regression diagnosed and resolved, remaining issues are corner cases).
|
|
15
|
+
- **Round 5**: 46 corrections (converged — below 5% threshold, no new patterns, no regressions).
|
|
16
|
+
|
|
17
|
+
**Key insight**: The round 3 spike was the most informative event. It revealed that the round 2 fix was too aggressive — it normalized dates that should not have been normalized. Without convergence tracking, this regression might have been masked by overall accuracy still improving on other cases.
|
|
18
|
+
|
|
19
|
+
## Diagnostic Flowchart
|
|
20
|
+
|
|
21
|
+
### If correction volume increases between iterations:
|
|
22
|
+
|
|
23
|
+
1. **Check for regression**: Are previously passing cases now failing? If yes, the last fix is the likely cause. Compare the diff between iterations.
|
|
24
|
+
2. **Check for fix conflicts**: Does the new fix contradict a prior fix? For example, broadening a regex in round N that was narrowed in round N-1.
|
|
25
|
+
3. **Check for test set changes**: Did new documents enter the test set between iterations? New documents can inflate correction volume without indicating regression.
|
|
26
|
+
|
|
27
|
+
### If correction volume stays flat (not decreasing):
|
|
28
|
+
|
|
29
|
+
1. **Check for oscillation**: Are the same cases flipping between pass and fail across iterations? This indicates the fix is unstable — it solves one variant but breaks another.
|
|
30
|
+
2. **Check if fix is too narrow**: The fix addresses the specific failing cases but does not generalize. The next iteration reveals similar cases the fix missed.
|
|
31
|
+
|
|
32
|
+
## False Convergence
|
|
33
|
+
|
|
34
|
+
Metrics look stable but underlying issues are masked. The system appears converged but will fail on production data.
|
|
35
|
+
|
|
36
|
+
### Common Causes
|
|
37
|
+
|
|
38
|
+
- **Test set too small**: With fewer than 20 test cases, a single case changing can swing metrics by 5%. Convergence at this scale is statistically meaningless.
|
|
39
|
+
- **Test set does not cover production variety**: The test set was curated from "clean" examples. Production documents include scanned PDFs, handwritten annotations, multi-language content, and formatting variations the test set never saw.
|
|
40
|
+
- **Corner cases excluded from metrics**: If difficult cases are moved to `corner_cases.json` and excluded from accuracy calculation, the remaining "easy" cases converge quickly but the real problem is hidden.
|
|
41
|
+
|
|
42
|
+
### Detection
|
|
43
|
+
|
|
44
|
+
Compare test set distribution to production distribution on key dimensions: document type, length, format, source. If they diverge significantly, convergence on the test set does not guarantee production quality.
|
|
45
|
+
|
|
46
|
+
## Estimating Remaining Rounds
|
|
47
|
+
|
|
48
|
+
### Simple Heuristic
|
|
49
|
+
|
|
50
|
+
If corrections approximately halve each round, expect `log2(current_corrections / threshold)` more rounds.
|
|
51
|
+
|
|
52
|
+
Example: current round has 200 corrections, threshold is 5% of 1000 cases = 50 corrections.
|
|
53
|
+
- Estimated remaining rounds: log2(200/50) = log2(4) = 2 rounds.
|
|
54
|
+
|
|
55
|
+
### When the Heuristic Fails
|
|
56
|
+
|
|
57
|
+
If corrections do not halve between rounds, the current approach may have hit its ceiling. Consider:
|
|
58
|
+
- Escalating the fix strategy (prompt tweak → logic rewrite → architecture change).
|
|
59
|
+
- Expanding the test set to reveal hidden patterns.
|
|
60
|
+
- Consulting the developer user for domain insight on stubborn failures.
|
|
61
|
+
|
|
62
|
+
Do not grind through more iterations expecting different results. If three consecutive rounds show similar correction volumes, stop and reassess.
|