kc-beta 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (141) hide show
  1. package/bin/kc-beta.js +16 -0
  2. package/package.json +32 -0
  3. package/src/agent/confidence-scorer.js +120 -0
  4. package/src/agent/context.js +124 -0
  5. package/src/agent/corner-case-registry.js +119 -0
  6. package/src/agent/engine.js +224 -0
  7. package/src/agent/events.js +27 -0
  8. package/src/agent/history.js +101 -0
  9. package/src/agent/llm-client.js +131 -0
  10. package/src/agent/pipelines/base.js +14 -0
  11. package/src/agent/pipelines/distillation.js +113 -0
  12. package/src/agent/pipelines/extraction.js +92 -0
  13. package/src/agent/pipelines/index.js +23 -0
  14. package/src/agent/pipelines/initializer.js +163 -0
  15. package/src/agent/pipelines/production-qc.js +99 -0
  16. package/src/agent/pipelines/skill-authoring.js +83 -0
  17. package/src/agent/pipelines/skill-testing.js +111 -0
  18. package/src/agent/tools/agent-tool.js +100 -0
  19. package/src/agent/tools/base.js +35 -0
  20. package/src/agent/tools/dashboard-render.js +146 -0
  21. package/src/agent/tools/document-parse.js +184 -0
  22. package/src/agent/tools/document-search.js +111 -0
  23. package/src/agent/tools/evolution-cycle.js +150 -0
  24. package/src/agent/tools/qc-sample.js +94 -0
  25. package/src/agent/tools/registry.js +55 -0
  26. package/src/agent/tools/rule-catalog.js +113 -0
  27. package/src/agent/tools/sandbox-exec.js +106 -0
  28. package/src/agent/tools/tier-downgrade.js +114 -0
  29. package/src/agent/tools/worker-llm-call.js +109 -0
  30. package/src/agent/tools/workflow-run.js +138 -0
  31. package/src/agent/tools/workspace-file.js +122 -0
  32. package/src/agent/version-manager.js +130 -0
  33. package/src/agent/workspace.js +82 -0
  34. package/src/cli/components.js +164 -0
  35. package/src/cli/index.js +329 -0
  36. package/src/cli/init.js +80 -0
  37. package/src/cli/onboard.js +182 -0
  38. package/src/cli/terminal.js +143 -0
  39. package/src/config.js +93 -0
  40. package/template/.env.template +31 -0
  41. package/template/CLAUDE.md +137 -0
  42. package/template/Input/.gitkeep +0 -0
  43. package/template/Output/.gitkeep +0 -0
  44. package/template/Rules/.gitkeep +0 -0
  45. package/template/Samples/.gitkeep +0 -0
  46. package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
  47. package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
  48. package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
  49. package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
  50. package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
  51. package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  52. package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
  53. package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
  54. package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
  55. package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
  56. package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
  57. package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
  58. package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
  59. package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  60. package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
  61. package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  62. package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
  63. package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
  64. package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  65. package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
  66. package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  67. package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
  68. package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
  69. package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  70. package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
  71. package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  72. package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
  73. package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  74. package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
  75. package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
  76. package/template/skills/en/skill-creator/LICENSE.txt +202 -0
  77. package/template/skills/en/skill-creator/SKILL.md +479 -0
  78. package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
  79. package/template/skills/en/skill-creator/agents/comparator.md +202 -0
  80. package/template/skills/en/skill-creator/agents/grader.md +223 -0
  81. package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
  82. package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
  83. package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
  84. package/template/skills/en/skill-creator/references/schemas.md +430 -0
  85. package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
  86. package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
  87. package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
  88. package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
  89. package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
  90. package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
  91. package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
  92. package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
  93. package/template/skills/en/skill-creator/scripts/utils.py +47 -0
  94. package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
  95. package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
  96. package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
  97. package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
  98. package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
  99. package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  100. package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
  101. package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
  102. package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
  103. package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
  104. package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
  105. package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
  106. package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
  107. package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  108. package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
  109. package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  110. package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
  111. package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
  112. package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  113. package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
  114. package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  115. package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
  116. package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
  117. package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  118. package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
  119. package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  120. package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
  121. package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  122. package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
  123. package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
  124. package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
  125. package/template/skills/zh/skill-creator/SKILL.md +479 -0
  126. package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
  127. package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
  128. package/template/skills/zh/skill-creator/agents/grader.md +223 -0
  129. package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
  130. package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
  131. package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
  132. package/template/skills/zh/skill-creator/references/schemas.md +430 -0
  133. package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
  134. package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
  135. package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
  136. package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
  137. package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
  138. package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
  139. package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
  140. package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
  141. package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
@@ -0,0 +1,332 @@
1
+ #!/usr/bin/env python3
2
+ """Run the eval + improve loop until all pass or max iterations reached.
3
+
4
+ Combines run_eval.py and improve_description.py in a loop, tracking history
5
+ and returning the best description found. Supports train/test split to prevent
6
+ overfitting.
7
+ """
8
+
9
+ import argparse
10
+ import json
11
+ import random
12
+ import sys
13
+ import tempfile
14
+ import time
15
+ import webbrowser
16
+ from pathlib import Path
17
+
18
+ import anthropic
19
+
20
+ from scripts.generate_report import generate_html
21
+ from scripts.improve_description import improve_description
22
+ from scripts.run_eval import find_project_root, run_eval
23
+ from scripts.utils import parse_skill_md
24
+
25
+
26
+ def split_eval_set(eval_set: list[dict], holdout: float, seed: int = 42) -> tuple[list[dict], list[dict]]:
27
+ """Split eval set into train and test sets, stratified by should_trigger."""
28
+ random.seed(seed)
29
+
30
+ # Separate by should_trigger
31
+ trigger = [e for e in eval_set if e["should_trigger"]]
32
+ no_trigger = [e for e in eval_set if not e["should_trigger"]]
33
+
34
+ # Shuffle each group
35
+ random.shuffle(trigger)
36
+ random.shuffle(no_trigger)
37
+
38
+ # Calculate split points
39
+ n_trigger_test = max(1, int(len(trigger) * holdout))
40
+ n_no_trigger_test = max(1, int(len(no_trigger) * holdout))
41
+
42
+ # Split
43
+ test_set = trigger[:n_trigger_test] + no_trigger[:n_no_trigger_test]
44
+ train_set = trigger[n_trigger_test:] + no_trigger[n_no_trigger_test:]
45
+
46
+ return train_set, test_set
47
+
48
+
49
+ def run_loop(
50
+ eval_set: list[dict],
51
+ skill_path: Path,
52
+ description_override: str | None,
53
+ num_workers: int,
54
+ timeout: int,
55
+ max_iterations: int,
56
+ runs_per_query: int,
57
+ trigger_threshold: float,
58
+ holdout: float,
59
+ model: str,
60
+ verbose: bool,
61
+ live_report_path: Path | None = None,
62
+ log_dir: Path | None = None,
63
+ ) -> dict:
64
+ """Run the eval + improvement loop."""
65
+ project_root = find_project_root()
66
+ name, original_description, content = parse_skill_md(skill_path)
67
+ current_description = description_override or original_description
68
+
69
+ # Split into train/test if holdout > 0
70
+ if holdout > 0:
71
+ train_set, test_set = split_eval_set(eval_set, holdout)
72
+ if verbose:
73
+ print(f"Split: {len(train_set)} train, {len(test_set)} test (holdout={holdout})", file=sys.stderr)
74
+ else:
75
+ train_set = eval_set
76
+ test_set = []
77
+
78
+ client = anthropic.Anthropic()
79
+ history = []
80
+ exit_reason = "unknown"
81
+
82
+ for iteration in range(1, max_iterations + 1):
83
+ if verbose:
84
+ print(f"\n{'='*60}", file=sys.stderr)
85
+ print(f"Iteration {iteration}/{max_iterations}", file=sys.stderr)
86
+ print(f"Description: {current_description}", file=sys.stderr)
87
+ print(f"{'='*60}", file=sys.stderr)
88
+
89
+ # Evaluate train + test together in one batch for parallelism
90
+ all_queries = train_set + test_set
91
+ t0 = time.time()
92
+ all_results = run_eval(
93
+ eval_set=all_queries,
94
+ skill_name=name,
95
+ description=current_description,
96
+ num_workers=num_workers,
97
+ timeout=timeout,
98
+ project_root=project_root,
99
+ runs_per_query=runs_per_query,
100
+ trigger_threshold=trigger_threshold,
101
+ model=model,
102
+ )
103
+ eval_elapsed = time.time() - t0
104
+
105
+ # Split results back into train/test by matching queries
106
+ train_queries_set = {q["query"] for q in train_set}
107
+ train_result_list = [r for r in all_results["results"] if r["query"] in train_queries_set]
108
+ test_result_list = [r for r in all_results["results"] if r["query"] not in train_queries_set]
109
+
110
+ train_passed = sum(1 for r in train_result_list if r["pass"])
111
+ train_total = len(train_result_list)
112
+ train_summary = {"passed": train_passed, "failed": train_total - train_passed, "total": train_total}
113
+ train_results = {"results": train_result_list, "summary": train_summary}
114
+
115
+ if test_set:
116
+ test_passed = sum(1 for r in test_result_list if r["pass"])
117
+ test_total = len(test_result_list)
118
+ test_summary = {"passed": test_passed, "failed": test_total - test_passed, "total": test_total}
119
+ test_results = {"results": test_result_list, "summary": test_summary}
120
+ else:
121
+ test_results = None
122
+ test_summary = None
123
+
124
+ history.append({
125
+ "iteration": iteration,
126
+ "description": current_description,
127
+ "train_passed": train_summary["passed"],
128
+ "train_failed": train_summary["failed"],
129
+ "train_total": train_summary["total"],
130
+ "train_results": train_results["results"],
131
+ "test_passed": test_summary["passed"] if test_summary else None,
132
+ "test_failed": test_summary["failed"] if test_summary else None,
133
+ "test_total": test_summary["total"] if test_summary else None,
134
+ "test_results": test_results["results"] if test_results else None,
135
+ # For backward compat with report generator
136
+ "passed": train_summary["passed"],
137
+ "failed": train_summary["failed"],
138
+ "total": train_summary["total"],
139
+ "results": train_results["results"],
140
+ })
141
+
142
+ # Write live report if path provided
143
+ if live_report_path:
144
+ partial_output = {
145
+ "original_description": original_description,
146
+ "best_description": current_description,
147
+ "best_score": "in progress",
148
+ "iterations_run": len(history),
149
+ "holdout": holdout,
150
+ "train_size": len(train_set),
151
+ "test_size": len(test_set),
152
+ "history": history,
153
+ }
154
+ live_report_path.write_text(generate_html(partial_output, auto_refresh=True, skill_name=name))
155
+
156
+ if verbose:
157
+ def print_eval_stats(label, results, elapsed):
158
+ pos = [r for r in results if r["should_trigger"]]
159
+ neg = [r for r in results if not r["should_trigger"]]
160
+ tp = sum(r["triggers"] for r in pos)
161
+ pos_runs = sum(r["runs"] for r in pos)
162
+ fn = pos_runs - tp
163
+ fp = sum(r["triggers"] for r in neg)
164
+ neg_runs = sum(r["runs"] for r in neg)
165
+ tn = neg_runs - fp
166
+ total = tp + tn + fp + fn
167
+ precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
168
+ recall = tp / (tp + fn) if (tp + fn) > 0 else 1.0
169
+ accuracy = (tp + tn) / total if total > 0 else 0.0
170
+ print(f"{label}: {tp+tn}/{total} correct, precision={precision:.0%} recall={recall:.0%} accuracy={accuracy:.0%} ({elapsed:.1f}s)", file=sys.stderr)
171
+ for r in results:
172
+ status = "PASS" if r["pass"] else "FAIL"
173
+ rate_str = f"{r['triggers']}/{r['runs']}"
174
+ print(f" [{status}] rate={rate_str} expected={r['should_trigger']}: {r['query'][:60]}", file=sys.stderr)
175
+
176
+ print_eval_stats("Train", train_results["results"], eval_elapsed)
177
+ if test_summary:
178
+ print_eval_stats("Test ", test_results["results"], 0)
179
+
180
+ if train_summary["failed"] == 0:
181
+ exit_reason = f"all_passed (iteration {iteration})"
182
+ if verbose:
183
+ print(f"\nAll train queries passed on iteration {iteration}!", file=sys.stderr)
184
+ break
185
+
186
+ if iteration == max_iterations:
187
+ exit_reason = f"max_iterations ({max_iterations})"
188
+ if verbose:
189
+ print(f"\nMax iterations reached ({max_iterations}).", file=sys.stderr)
190
+ break
191
+
192
+ # Improve the description based on train results
193
+ if verbose:
194
+ print(f"\nImproving description...", file=sys.stderr)
195
+
196
+ t0 = time.time()
197
+ # Strip test scores from history so improvement model can't see them
198
+ blinded_history = [
199
+ {k: v for k, v in h.items() if not k.startswith("test_")}
200
+ for h in history
201
+ ]
202
+ new_description = improve_description(
203
+ client=client,
204
+ skill_name=name,
205
+ skill_content=content,
206
+ current_description=current_description,
207
+ eval_results=train_results,
208
+ history=blinded_history,
209
+ model=model,
210
+ log_dir=log_dir,
211
+ iteration=iteration,
212
+ )
213
+ improve_elapsed = time.time() - t0
214
+
215
+ if verbose:
216
+ print(f"Proposed ({improve_elapsed:.1f}s): {new_description}", file=sys.stderr)
217
+
218
+ current_description = new_description
219
+
220
+ # Find the best iteration by TEST score (or train if no test set)
221
+ if test_set:
222
+ best = max(history, key=lambda h: h["test_passed"] or 0)
223
+ best_score = f"{best['test_passed']}/{best['test_total']}"
224
+ else:
225
+ best = max(history, key=lambda h: h["train_passed"])
226
+ best_score = f"{best['train_passed']}/{best['train_total']}"
227
+
228
+ if verbose:
229
+ print(f"\nExit reason: {exit_reason}", file=sys.stderr)
230
+ print(f"Best score: {best_score} (iteration {best['iteration']})", file=sys.stderr)
231
+
232
+ return {
233
+ "exit_reason": exit_reason,
234
+ "original_description": original_description,
235
+ "best_description": best["description"],
236
+ "best_score": best_score,
237
+ "best_train_score": f"{best['train_passed']}/{best['train_total']}",
238
+ "best_test_score": f"{best['test_passed']}/{best['test_total']}" if test_set else None,
239
+ "final_description": current_description,
240
+ "iterations_run": len(history),
241
+ "holdout": holdout,
242
+ "train_size": len(train_set),
243
+ "test_size": len(test_set),
244
+ "history": history,
245
+ }
246
+
247
+
248
+ def main():
249
+ parser = argparse.ArgumentParser(description="Run eval + improve loop")
250
+ parser.add_argument("--eval-set", required=True, help="Path to eval set JSON file")
251
+ parser.add_argument("--skill-path", required=True, help="Path to skill directory")
252
+ parser.add_argument("--description", default=None, help="Override starting description")
253
+ parser.add_argument("--num-workers", type=int, default=10, help="Number of parallel workers")
254
+ parser.add_argument("--timeout", type=int, default=30, help="Timeout per query in seconds")
255
+ parser.add_argument("--max-iterations", type=int, default=5, help="Max improvement iterations")
256
+ parser.add_argument("--runs-per-query", type=int, default=3, help="Number of runs per query")
257
+ parser.add_argument("--trigger-threshold", type=float, default=0.5, help="Trigger rate threshold")
258
+ parser.add_argument("--holdout", type=float, default=0.4, help="Fraction of eval set to hold out for testing (0 to disable)")
259
+ parser.add_argument("--model", required=True, help="Model for improvement")
260
+ parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
261
+ parser.add_argument("--report", default="auto", help="Generate HTML report at this path (default: 'auto' for temp file, 'none' to disable)")
262
+ parser.add_argument("--results-dir", default=None, help="Save all outputs (results.json, report.html, log.txt) to a timestamped subdirectory here")
263
+ args = parser.parse_args()
264
+
265
+ eval_set = json.loads(Path(args.eval_set).read_text())
266
+ skill_path = Path(args.skill_path)
267
+
268
+ if not (skill_path / "SKILL.md").exists():
269
+ print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
270
+ sys.exit(1)
271
+
272
+ name, _, _ = parse_skill_md(skill_path)
273
+
274
+ # Set up live report path
275
+ if args.report != "none":
276
+ if args.report == "auto":
277
+ timestamp = time.strftime("%Y%m%d_%H%M%S")
278
+ live_report_path = Path(tempfile.gettempdir()) / f"skill_description_report_{skill_path.name}_{timestamp}.html"
279
+ else:
280
+ live_report_path = Path(args.report)
281
+ # Open the report immediately so the user can watch
282
+ live_report_path.write_text("<html><body><h1>Starting optimization loop...</h1><meta http-equiv='refresh' content='5'></body></html>")
283
+ webbrowser.open(str(live_report_path))
284
+ else:
285
+ live_report_path = None
286
+
287
+ # Determine output directory (create before run_loop so logs can be written)
288
+ if args.results_dir:
289
+ timestamp = time.strftime("%Y-%m-%d_%H%M%S")
290
+ results_dir = Path(args.results_dir) / timestamp
291
+ results_dir.mkdir(parents=True, exist_ok=True)
292
+ else:
293
+ results_dir = None
294
+
295
+ log_dir = results_dir / "logs" if results_dir else None
296
+
297
+ output = run_loop(
298
+ eval_set=eval_set,
299
+ skill_path=skill_path,
300
+ description_override=args.description,
301
+ num_workers=args.num_workers,
302
+ timeout=args.timeout,
303
+ max_iterations=args.max_iterations,
304
+ runs_per_query=args.runs_per_query,
305
+ trigger_threshold=args.trigger_threshold,
306
+ holdout=args.holdout,
307
+ model=args.model,
308
+ verbose=args.verbose,
309
+ live_report_path=live_report_path,
310
+ log_dir=log_dir,
311
+ )
312
+
313
+ # Save JSON output
314
+ json_output = json.dumps(output, indent=2)
315
+ print(json_output)
316
+ if results_dir:
317
+ (results_dir / "results.json").write_text(json_output)
318
+
319
+ # Write final HTML report (without auto-refresh)
320
+ if live_report_path:
321
+ live_report_path.write_text(generate_html(output, auto_refresh=False, skill_name=name))
322
+ print(f"\nReport: {live_report_path}", file=sys.stderr)
323
+
324
+ if results_dir and live_report_path:
325
+ (results_dir / "report.html").write_text(generate_html(output, auto_refresh=False, skill_name=name))
326
+
327
+ if results_dir:
328
+ print(f"Results saved to: {results_dir}", file=sys.stderr)
329
+
330
+
331
+ if __name__ == "__main__":
332
+ main()
@@ -0,0 +1,47 @@
1
+ """Shared utilities for skill-creator scripts."""
2
+
3
+ from pathlib import Path
4
+
5
+
6
+
7
+ def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
8
+ """Parse a SKILL.md file, returning (name, description, full_content)."""
9
+ content = (skill_path / "SKILL.md").read_text()
10
+ lines = content.split("\n")
11
+
12
+ if lines[0].strip() != "---":
13
+ raise ValueError("SKILL.md missing frontmatter (no opening ---)")
14
+
15
+ end_idx = None
16
+ for i, line in enumerate(lines[1:], start=1):
17
+ if line.strip() == "---":
18
+ end_idx = i
19
+ break
20
+
21
+ if end_idx is None:
22
+ raise ValueError("SKILL.md missing frontmatter (no closing ---)")
23
+
24
+ name = ""
25
+ description = ""
26
+ frontmatter_lines = lines[1:end_idx]
27
+ i = 0
28
+ while i < len(frontmatter_lines):
29
+ line = frontmatter_lines[i]
30
+ if line.startswith("name:"):
31
+ name = line[len("name:"):].strip().strip('"').strip("'")
32
+ elif line.startswith("description:"):
33
+ value = line[len("description:"):].strip()
34
+ # Handle YAML multiline indicators (>, |, >-, |-)
35
+ if value in (">", "|", ">-", "|-"):
36
+ continuation_lines: list[str] = []
37
+ i += 1
38
+ while i < len(frontmatter_lines) and (frontmatter_lines[i].startswith(" ") or frontmatter_lines[i].startswith("\t")):
39
+ continuation_lines.append(frontmatter_lines[i].strip())
40
+ i += 1
41
+ description = " ".join(continuation_lines)
42
+ continue
43
+ else:
44
+ description = value.strip('"').strip("'")
45
+ i += 1
46
+
47
+ return name, description, content
@@ -0,0 +1,303 @@
1
+ ---
2
+ name: compliance-judgment
3
+ description: Determine whether extracted entities comply with verification rules. Use after entity extraction to make the pass/fail judgment for each rule on each document. Covers translating natural language rules into executable logic, choosing between Python calculation and LLM semantic judgment, and producing actionable comments on failures. Also use when designing the judgment step of a workflow or when a rule's judgment logic needs debugging.
4
+ ---
5
+
6
+ # 合规判定
7
+
8
+ 判定是核查流程的关键时刻。你已经提取到了实体值,你手里有规则要求。现在要回答一个问题:**合规还是不合规?**
9
+
10
+ 答案必须清晰、正确,并且在不合规时附带简洁、可操作的评论。
11
+
12
+ ## 判定类型谱
13
+
14
+ 规则落在一个谱系上,从完全确定性到完全语义化:
15
+
16
+ ```
17
+ 确定性判定(Python)◄───────────────────────►语义判定(LLM)
18
+ 阈值检查 格式验证 日期计算 交叉一致 充分性 完整性 一致性 模板合规
19
+ ```
20
+
21
+ 左侧的判定用代码解决——免费、即时、确定。右侧的判定需要语言理解——需要 LLM、有成本、有不确定性。大多数规则处于中间位置,需要混合方法。
22
+
23
+ ## 确定性判定(用 Python)
24
+
25
+ 规则有明确、可计算的标准时,用 Python 实现判定逻辑。
26
+
27
+ ### 阈值检查
28
+
29
+ 金融监管中最常见的判定类型。
30
+
31
+ ```python
32
+ # 资本充足率 ≥ 8%(银保监会要求)
33
+ result = "pass" if extracted_ratio >= 8.0 else "fail"
34
+ comment = f"资本充足率为{extracted_ratio}%,低于监管最低要求8.0%" if result == "fail" else ""
35
+
36
+ # 不良贷款率(通常监控 < 5%,但阈值因机构类型而异)
37
+ result = "pass" if npl_ratio < threshold else "fail"
38
+
39
+ # 拨备覆盖率 ≥ 150%
40
+ result = "pass" if provision_coverage >= 150.0 else "fail"
41
+
42
+ # 单一客户贷款集中度 ≤ 10%
43
+ result = "pass" if single_exposure <= 10.0 else "fail"
44
+ ```
45
+
46
+ 注意事项:
47
+ - 边界值的处理要明确:`>=` 还是 `>`?监管文件中"不低于"对应 `>=`,"低于"对应 `<`。
48
+ - 浮点精度:用 `Decimal` 或设定合理的容差(如 0.01%)。金融数据通常精确到小数点后两位。
49
+
50
+ ### 格式验证
51
+
52
+ ```python
53
+ import re
54
+
55
+ # 贷款编号格式:XX-YYYY-ZZZZZZ
56
+ result = "pass" if re.match(r"[A-Z]{2}-\d{4}-\d{6}", loan_number) else "fail"
57
+
58
+ # 统一社会信用代码:18位
59
+ result = "pass" if re.match(r"^[0-9A-Z]{18}$", uscc) else "fail"
60
+
61
+ # 手机号格式
62
+ result = "pass" if re.match(r"^1[3-9]\d{9}$", phone) else "fail"
63
+ ```
64
+
65
+ ### 日期计算
66
+
67
+ ```python
68
+ from datetime import datetime, timedelta
69
+
70
+ # 合同签署日期在申请日期30天内
71
+ sign_date = datetime.strptime(extracted_sign_date, "%Y-%m-%d")
72
+ app_date = datetime.strptime(extracted_app_date, "%Y-%m-%d")
73
+ result = "pass" if (sign_date - app_date).days <= 30 else "fail"
74
+ comment = f"签署日期{extracted_sign_date}距申请日期{extracted_app_date}为{(sign_date - app_date).days}天,超过30天限制" if result == "fail" else ""
75
+
76
+ # 贷款到期日不早于合同约定
77
+ result = "pass" if actual_maturity >= contracted_maturity else "fail"
78
+
79
+ # 报告出具日期在报告期末后4个月内(年报要求)
80
+ report_date = datetime.strptime(extracted_report_date, "%Y-%m-%d")
81
+ period_end = datetime.strptime(extracted_period_end, "%Y-%m-%d")
82
+ deadline = period_end + timedelta(days=120) # 约4个月
83
+ result = "pass" if report_date <= deadline else "fail"
84
+ ```
85
+
86
+ ### 交叉一致性检查
87
+
88
+ ```python
89
+ # 合计数等于明细之和
90
+ result = "pass" if abs(total - sum(items)) < 0.01 else "fail"
91
+ comment = f"合计数{total}与明细之和{sum(items)}不一致,差额{total - sum(items)}" if result == "fail" else ""
92
+
93
+ # 资产负债表平衡:资产 = 负债 + 所有者权益
94
+ result = "pass" if abs(assets - liabilities - equity) < 0.01 else "fail"
95
+
96
+ # 同一指标在不同章节的值一致
97
+ result = "pass" if value_in_summary == value_in_detail else "fail"
98
+ comment = f"摘要中为{value_in_summary},明细中为{value_in_detail}" if result == "fail" else ""
99
+ ```
100
+
101
+ 确定性判定是首选。它们免费、即时、可复现。能用 Python 解决的判定,绝不调用 LLM。
102
+
103
+ ## 语义判定(用 LLM)
104
+
105
+ 规则需要语言理解时使用 LLM。
106
+
107
+ ### 充分性判定
108
+
109
+ "风险披露是否充分描述了主要风险因素。"
110
+
111
+ 这无法用 Python 判定——"充分"是一个需要理解内容的语义概念。
112
+
113
+ LLM 判定提示词设计要点:
114
+ 1. 提供规则全文(什么构成合规)。
115
+ 2. 提供提取的文档内容(文档实际说了什么)。
116
+ 3. 要求结构化输出:pass/fail、推理过程、评论。
117
+ 4. 要求保守判定——只在明确不合规时判 fail。真正模糊的情况用 uncertain。
118
+
119
+ ### 完整性判定
120
+
121
+ "管理层讨论与分析是否涵盖了财务状况、经营成果和现金流量三个方面。"
122
+
123
+ 这是一个清单式的语义判定:内容是否覆盖了规定的多个主题。
124
+
125
+ ```
126
+ 请判定以下管理层讨论与分析是否涵盖以下三个必要主题:
127
+ 1. 财务状况分析
128
+ 2. 经营成果分析
129
+ 3. 现金流量分析
130
+
131
+ 文档内容:
132
+ {extracted_section}
133
+
134
+ 对每个主题,判定是否有实质性讨论(不只是提及标题)。
135
+ 返回 JSON:
136
+ {
137
+ "topic_1_covered": true/false,
138
+ "topic_2_covered": true/false,
139
+ "topic_3_covered": true/false,
140
+ "overall": "pass/fail",
141
+ "comment": "..."
142
+ }
143
+ ```
144
+
145
+ ### 一致性判定
146
+
147
+ "执行摘要与详细调查结果是否一致。"
148
+
149
+ 需要对两段文本进行语义比较,检查是否存在矛盾或遗漏。
150
+
151
+ ### 模板合规判定
152
+
153
+ "报告是否按照《XX管理办法》附件一的格式编写。"
154
+
155
+ 需要将实际文档结构与模板要求进行逐项对比。
156
+
157
+ ## 混合判定
158
+
159
+ 大多数规则实际上需要混合方法。先跑廉价的确定性步骤,必要时再调用 LLM。
160
+
161
+ ### 示例:资本充足率核查
162
+
163
+ ```
164
+ 步骤1(正则提取):提取"资本充足率"对应的数值 → 12.5%
165
+ 步骤2(Python判定):12.5% >= 8.0% → pass
166
+ ```
167
+ 如果步骤 1 提取失败或置信度低:
168
+ ```
169
+ 步骤3(LLM提取):请从以下内容中找出资本充足率的最新值 → 12.50%
170
+ 步骤4(Python判定):12.50% >= 8.0% → pass
171
+ ```
172
+ 如果值在边界附近(如 8.02%):
173
+ ```
174
+ 步骤5(LLM审查):请确认12.50%是否为最终调整后的资本充足率,而非中间计算值
175
+ ```
176
+
177
+ 这个漏斗保证了:90% 的文档在步骤 2 就完成(零 LLM 成本),只有困难情况才调用 LLM。
178
+
179
+ ## 输出格式
180
+
181
+ 每条规则对每份文档的判定结果:
182
+
183
+ ```json
184
+ {
185
+ "rule_id": "R001",
186
+ "document": "bank_annual_report_2024.pdf",
187
+ "result": "pass",
188
+ "extracted_value": "12.5%",
189
+ "expected": ">= 8.0%",
190
+ "comment": "",
191
+ "confidence": 0.95
192
+ }
193
+ ```
194
+
195
+ ### result 取值说明
196
+
197
+ | 值 | 含义 | 评论要求 |
198
+ |---|------|---------|
199
+ | **pass** | 实体合规 | 通常无需评论 |
200
+ | **fail** | 实体不合规 | **必须**附带评论 |
201
+ | **missing** | 实体在文档中未找到 | 注明搜索范围 |
202
+ | **error** | 提取或判定过程出错 | 注明错误类型 |
203
+ | **uncertain** | 判定模糊,需人工审查 | 说明不确定原因 |
204
+
205
+ **missing 与 fail 的区别至关重要**:missing 是提取层面的问题(信息不存在),fail 是判定层面的问题(信息存在但不合规)。混淆二者会导致错误的统计和错误的行动方向。
206
+
207
+ ## 评论要求
208
+
209
+ 评论是给人看的。它应该让审查人员一眼明白问题所在。
210
+
211
+ ### 好的评论
212
+
213
+ ```
214
+ "资本充足率为7.2%,低于监管最低要求8.0%。"
215
+ "贷款合同签署日期2024-05-15距申请日期2024-03-01为75天,超过规定的30天期限。"
216
+ "资产负债表不平衡:总资产1,234,567万元,负债+所有者权益为1,234,590万元,差额23万元。"
217
+ "未在风险管理章节中找到流动性风险的专项讨论。"
218
+ ```
219
+
220
+ ### 不好的评论
221
+
222
+ ```
223
+ "不合规。" ← 没有具体信息
224
+ "资本充足率不达标,存在重大风险隐患。" ← 加了主观判断
225
+ "该银行的资本充足率为7.2%,根据银保监会2023年发布的……(长篇大论)" ← 过于冗长
226
+ ```
227
+
228
+ ### 评论原则
229
+
230
+ - **简洁事实**:提取值 + 期望值 + 差异,三句话以内。
231
+ - **仅在 fail 时给出**:pass 的结果不需要评论,除非开发者用户明确要求。
232
+ - **不加主观判断**:不说"严重"、"重大"、"令人担忧"。只陈述事实。
233
+ - **包含关键数值**:让审查人员无需回看原文就能理解问题。
234
+
235
+ ### 轻量标注格式
236
+
237
+ 为便于人工审查、节省 token 开销、以及在不同核查轮次之间做 diff 比较,判定结果也可以用紧凑的文本标注格式表达:
238
+
239
+ ```
240
+ [PASS] capital_adequacy <- 12.5% (>= 8.0%) | conf:0.95 | src:p3-s2
241
+ [FAIL] sign_date_gap <- 75d (<= 30d) | conf:0.90 | src:p1-s4 | note:签署超期45天
242
+ [MISSING] collateral_value | conf:0.60 | note:未在文档中找到担保物估值
243
+ ```
244
+
245
+ 此格式与上述 JSON 格式可无损互转。在以下场景中使用此格式:向开发者用户展示结果以便快速审阅、在演化迭代摘要中记录日志以节省 token、在核查轮次之间计算 diff。参见 `references/output-format.md` 获取完整的格式规范和转换规则。
246
+
247
+ ## 判定顺序
248
+
249
+ 有些规则之间存在依赖关系:
250
+
251
+ - **条件依赖**:规则 B 只在规则 A 通过时适用。"如果借款人为新客户(规则 A),则需要额外的尽调文件(规则 B)。"
252
+ - **值依赖**:规则 C 使用规则 A 计算的值。"风险加权资本比率(规则 A)决定了所需的拨备水平(规则 C)。"
253
+ - **逻辑依赖**:规则 D 只在规则 A 和 B 都失败时才需要检查。
254
+
255
+ 在规则目录中标注这些依赖关系。按依赖顺序执行规则。将上游规则的结果作为下游规则的上下文传递。
256
+
257
+ ### 依赖图示例
258
+
259
+ ```
260
+ R001(资本充足率提取) → R002(资本充足率阈值判定)
261
+ R003(核心一级资本提取)→ R004(核心一级资本充足率计算)→ R002
262
+ R001 + R005(杠杆率)→ R006(综合评级)
263
+ ```
264
+
265
+ 如果上游规则结果为 missing 或 error,下游规则也应标记为 error 或 unable_to_judge,而非强行判定。
266
+
267
+ ## 边缘情况处理
268
+
269
+ ### 空提取
270
+
271
+ 实体未找到。默认判定为 **missing**,而非 fail。缺失值是提取层面的问题,不是合规层面的问题。将其反馈给解析和提取步骤,可能需要升级解析器或调整提取策略。
272
+
273
+ ### 多值冲突
274
+
275
+ 文档中同一实体出现在多处,且值不一致。
276
+
277
+ - 标记为 **uncertain**。
278
+ - 在评论中列出所有找到的值及其来源位置。
279
+ - 如果规则指定了优先来源(如"以审计报告中的数值为准"),使用该来源的值。
280
+
281
+ ### 条件规则
282
+
283
+ "如果贷款金额超过 1000 万元,则需要提供担保。"
284
+
285
+ - 先检查条件:贷款金额是否超过 1000 万?
286
+ - 条件不满足 → 规则不适用 → 结果为 pass(或 not_applicable)。
287
+ - 条件满足 → 继续检查后续要求。
288
+
289
+ ### 否定规则
290
+
291
+ "文档中不应包含对关联方的担保承诺。"
292
+
293
+ 搜索"不存在"比搜索"存在"更难。策略:
294
+ - 在文档中搜索关键词("关联方"+"担保"+"承诺")。
295
+ - 如果找到匹配,提取上下文送 LLM 确认是否构成实际的担保承诺(可能只是声明"未提供担保")。
296
+ - 如果没有找到任何匹配,判定 pass,但置信度降低(因为搜索可能不完整)。
297
+
298
+ ### 数值精度问题
299
+
300
+ 金融数据经常面临精度问题:
301
+ - 报告中写"12.5%",但实际精确值可能是"12.4997%"。
302
+ - 四舍五入导致的微小差异不应被判为 fail。
303
+ - 在阈值比较中设定合理容差。如"资本充足率 >= 8%",可设定容差为 0.05%,即 7.95% 以上都不直接判 fail,而是标记为 uncertain 并提请人工审查。