kc-beta 0.8.1 → 0.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/agent/context.js +17 -1
- package/src/agent/engine.js +85 -8
- package/src/agent/llm-client.js +24 -1
- package/src/agent/pipelines/_milestone-derive.js +78 -7
- package/src/agent/pipelines/skill-authoring.js +19 -2
- package/src/agent/tools/release.js +94 -1
- package/src/cli/index.js +28 -7
- package/template/.env.template +1 -1
- package/template/AGENT.md +2 -2
- package/template/skills/en/auto-model-selection/SKILL.md +55 -35
- package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/en/compliance-judgment/SKILL.md +14 -0
- package/template/skills/en/confidence-system/SKILL.md +30 -8
- package/template/skills/en/corner-case-management/SKILL.md +53 -33
- package/template/skills/en/cross-document-verification/SKILL.md +88 -83
- package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
- package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/en/data-sensibility/SKILL.md +19 -12
- package/template/skills/en/document-chunking/SKILL.md +99 -15
- package/template/skills/en/entity-extraction/SKILL.md +14 -4
- package/template/skills/en/quality-control/SKILL.md +14 -0
- package/template/skills/en/rule-extraction/SKILL.md +92 -94
- package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
- package/template/skills/en/skill-authoring/SKILL.md +52 -8
- package/template/skills/en/skill-creator/SKILL.md +25 -3
- package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
- package/template/skills/en/task-decomposition/SKILL.md +1 -1
- package/template/skills/en/tree-processing/SKILL.md +1 -1
- package/template/skills/en/version-control/SKILL.md +15 -0
- package/template/skills/en/work-decomposition/SKILL.md +21 -35
- package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
- package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
- package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
- package/template/skills/zh/confidence-system/SKILL.md +34 -9
- package/template/skills/zh/corner-case-management/SKILL.md +71 -104
- package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
- package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
- package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
- package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/zh/data-sensibility/SKILL.md +13 -0
- package/template/skills/zh/document-chunking/SKILL.md +96 -20
- package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
- package/template/skills/zh/entity-extraction/SKILL.md +14 -4
- package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
- package/template/skills/zh/quality-control/SKILL.md +14 -0
- package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
- package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
- package/template/skills/zh/rule-extraction/SKILL.md +199 -188
- package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
- package/template/skills/zh/skill-authoring/SKILL.md +108 -69
- package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
- package/template/skills/zh/skill-creator/SKILL.md +71 -61
- package/template/skills/zh/skill-creator/references/schemas.md +60 -60
- package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
- package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
- package/template/skills/zh/task-decomposition/SKILL.md +1 -1
- package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
- package/template/skills/zh/tree-processing/SKILL.md +1 -1
- package/template/skills/zh/version-control/SKILL.md +15 -0
- package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
- package/template/skills/zh/work-decomposition/SKILL.md +21 -33
|
@@ -16,6 +16,12 @@ Data/entity extraction (`entity-extraction`) is the **repeating task** that runs
|
|
|
16
16
|
|
|
17
17
|
Don't conflate the two. Rule extraction happens once; data extraction happens on every document.
|
|
18
18
|
|
|
19
|
+
## Source-first sequencing
|
|
20
|
+
|
|
21
|
+
Extract rules from the source text FIRST. Only after you have a complete first-pass catalog from sources alone should you open sample documents. The temptation is to peek at samples early to "see what kinds of rules matter" — this biases you toward rules the samples happen to exercise and silently drops rules the samples don't cover.
|
|
22
|
+
|
|
23
|
+
A domain professional reads the source material, builds an understanding, then validates on samples — not the reverse. KC's differentiator over general-purpose agents is systematic accuracy across long context; that advantage compounds when you ground in the SOURCE not the EXAMPLES.
|
|
24
|
+
|
|
19
25
|
## Rule Structure: Location → Extraction → Judgment
|
|
20
26
|
|
|
21
27
|
Every verification rule decomposes into three parts:
|
|
@@ -62,22 +68,17 @@ When rules change (additions, modifications, deprecations), version the entire r
|
|
|
62
68
|
|
|
63
69
|
## Granularity Calibration (read before extracting)
|
|
64
70
|
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
If your first pass
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
- **Drop procedural language** that isn't checkable against a report
|
|
77
|
-
(definitions, scope statements, references to other regs that just
|
|
78
|
-
transitively apply).
|
|
79
|
-
- **Keep only checkable obligations, prohibitions, and thresholds** —
|
|
80
|
-
things where you can read a sample report and say pass or fail.
|
|
71
|
+
Rule catalogs come from diverse source materials — formal regulations, internal handbooks, case law, legal opinions, expert rule tables, regulator Q&A. There is no universal "right number of rules per page". Calibrate by logic, not by count:
|
|
72
|
+
|
|
73
|
+
- **Atomicity is the real test.** A rule that can produce two independent pass/fail outcomes is two rules. A rule whose verdict requires verifying three different paragraphs of the source is probably three rules.
|
|
74
|
+
- **Boilerplate is not a rule.** Definitions, scope statements, transitive references to other regulations, and procedural language that can't be checked against the target document do not become rules.
|
|
75
|
+
- **Keep only checkable obligations, prohibitions, and thresholds** — things where you can read a target document and say pass / fail / not-applicable.
|
|
76
|
+
|
|
77
|
+
If your first pass feels too coarse (one rule per chapter, ignoring multiple distinct obligations within) — go finer. If it feels too fine (every clause in a definitions section is its own rule) — merge or drop. Then:
|
|
78
|
+
|
|
79
|
+
- **Merge rules that share evidence and fail together** (e.g., "must disclose X" and "must disclose Y" where both come from the same required-fields table → one rule: "must disclose the required-fields list including X, Y").
|
|
80
|
+
- **Drop procedural language** that isn't checkable against a target document.
|
|
81
|
+
- **Convert each surviving rule into a falsifiability statement** — if you can't state precisely what would make it fail, you don't have a rule yet.
|
|
81
82
|
|
|
82
83
|
### Sample "good" rule
|
|
83
84
|
|
|
@@ -94,104 +95,58 @@ If your first pass produces more than ~25 rules for a single regulation:
|
|
|
94
95
|
}
|
|
95
96
|
```
|
|
96
97
|
|
|
97
|
-
Note: one pass/fail outcome, a single `source_ref` to a specific clause,
|
|
98
|
-
clear applicability scope. Skill-authoring can write `check_r014.py` from
|
|
99
|
-
this alone.
|
|
98
|
+
Note: one pass/fail outcome, a single `source_ref` to a specific clause, clear applicability scope. Skill-authoring can write `check_r014.py` from this alone.
|
|
100
99
|
|
|
101
|
-
### Cross-
|
|
100
|
+
### Cross-source dedup (when working across multiple documents)
|
|
102
101
|
|
|
103
|
-
If the developer user provides N
|
|
104
|
-
duplicate cross-cutting requirements already captured by earlier ones
|
|
105
|
-
(e.g., a 2018 generic disclosure rule vs. a 2025 specific version).
|
|
106
|
-
Before emitting a rule from reg N:
|
|
102
|
+
If the developer user provides N source documents, rules from later sources often duplicate cross-cutting requirements already captured by earlier ones (e.g., a generic disclosure rule from an older regulation vs. a newer specific version of the same obligation). Before emitting a rule from source N:
|
|
107
103
|
|
|
108
|
-
1. **Check the existing catalog.** Use `rule_catalog` (operation: list)
|
|
109
|
-
to see what's already there. Skip if a rule with equivalent scope +
|
|
110
|
-
intent exists.
|
|
104
|
+
1. **Check the existing catalog.** Use `rule_catalog` (operation: list) to see what's already there. Skip if a rule with equivalent scope + intent exists.
|
|
111
105
|
2. **Prefer the newer / more specific source_ref** when rules overlap.
|
|
112
|
-
3. **If you merged rules**, record the consolidated sources in
|
|
113
|
-
`source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
|
|
106
|
+
3. **If you merged rules**, record the consolidated sources in `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
|
|
114
107
|
|
|
115
108
|
### Delegation to sub-agents
|
|
116
109
|
|
|
117
|
-
If you dispatch extraction to sub-agents (one per
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
- **
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
"the core regs" as a pronoun — LLMs composing long structured briefs
|
|
128
|
-
frequently drop items (observed in session 6304673afaa0 where reg 02
|
|
129
|
-
was silently omitted).
|
|
130
|
-
- **State the dedup contract**: "Rules already in the parent's catalog
|
|
131
|
-
(R001–Rnnn) should NOT be re-extracted. If a requirement is already
|
|
132
|
-
covered, skip it." Then pass the current catalog's ID ranges.
|
|
133
|
-
- **Prefer `rule_catalog` create operations over sandbox_exec writes to
|
|
134
|
-
catalog.json.** rule_catalog uses workspace file locking;
|
|
135
|
-
sandbox_exec bypasses it and races with other writers.
|
|
136
|
-
|
|
137
|
-
## How to read regulation files (default: read whole)
|
|
138
|
-
|
|
139
|
-
Regulations are the audit's authoritative basis. Every `source_ref`
|
|
140
|
-
in your extracted rules must be verifiable against the source text.
|
|
141
|
-
For typical regulation documents (a single file under ~50 KB / under
|
|
142
|
-
~100 pages), **read each regulation file whole using `workspace_file`
|
|
143
|
-
(operation=read) in a single call**:
|
|
110
|
+
If you dispatch extraction to sub-agents (one per source document), the sub-agent inherits ONLY its `task_description` — it cannot see your conversation or existing catalog. Therefore, when composing the brief:
|
|
111
|
+
|
|
112
|
+
- **Anchor calibration with a concrete sample rule.** Paste the JSON above verbatim into the brief body so the sub-agent's atomicity calibration matches yours.
|
|
113
|
+
- **Name every source document the sub-agent should process.** If AGENT.md lists 10 core source documents, the brief must list all 10 by name, not "the core regs" as a pronoun — LLMs composing long structured briefs frequently drop items silently.
|
|
114
|
+
- **State the dedup contract**: "Rules already in the parent's catalog (R001–Rnnn) should NOT be re-extracted. If a requirement is already covered, skip it." Then pass the current catalog's ID ranges.
|
|
115
|
+
- **Prefer `rule_catalog` create operations over sandbox_exec writes to catalog.json.** rule_catalog uses workspace file locking; sandbox_exec bypasses it and races with other writers.
|
|
116
|
+
|
|
117
|
+
## How to read source files (default: read whole)
|
|
118
|
+
|
|
119
|
+
Source documents are the catalog's authoritative basis. Every `source_ref` in your extracted rules must be verifiable against the source text. For typical source documents (a single file under ~50 KB / under ~100 pages), **read each source file whole using `workspace_file` (operation=read) in a single call**:
|
|
144
120
|
|
|
145
121
|
```js
|
|
146
|
-
workspace_file({ operation: "read", scope: "project", path: "Rules/
|
|
122
|
+
workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_source.md" })
|
|
147
123
|
```
|
|
148
124
|
|
|
149
|
-
`workspace_file.read` is capped at 50,000 chars per call, which
|
|
150
|
-
covers virtually every individual regulation document. This is the
|
|
151
|
-
default. **Read every regulation file whole before you start
|
|
152
|
-
extracting rules from any of them.**
|
|
125
|
+
`workspace_file.read` is capped at 50,000 chars per call, which covers virtually every individual source document. This is the default. **Read every source file whole before you start extracting rules from any of them.**
|
|
153
126
|
|
|
154
127
|
### Tool choice — `workspace_file` vs `sandbox_exec`
|
|
155
128
|
|
|
156
129
|
| Tool | Per-call cap | Use for |
|
|
157
130
|
|---|---:|---|
|
|
158
|
-
| `workspace_file` (read) | 50,000 chars | **full reads of
|
|
131
|
+
| `workspace_file` (read) | 50,000 chars | **full reads of source / rule documents** |
|
|
159
132
|
| `sandbox_exec` (cat/head/etc) | 10,000 chars | shell commands, **not** full file reads |
|
|
160
133
|
|
|
161
|
-
`sandbox_exec` is designed for shell commands; its 10K cap is too
|
|
162
|
-
small for most regulations. `cat rules/01_*.md` returns only the
|
|
163
|
-
first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` /
|
|
164
|
-
`tail -M` to scroll the window loses positional precision and burns
|
|
165
|
-
turns. **When you see truncation, don't fight the cap — switch
|
|
166
|
-
tools.**
|
|
134
|
+
`sandbox_exec` is designed for shell commands; its 10K cap is too small for most regulations. `cat rules/01_*.md` returns only the first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` / `tail -M` to scroll the window loses positional precision and burns turns. **When you see truncation, don't fight the cap — switch tools.**
|
|
167
135
|
|
|
168
|
-
### Asymmetry —
|
|
136
|
+
### Asymmetry — sources read whole, samples sampled
|
|
169
137
|
|
|
170
|
-
|
|
171
|
-
read once. Read every regulation whole.
|
|
138
|
+
Source documents are limited (typically 1-10 files), authoritative, and read once. Read every source file whole.
|
|
172
139
|
|
|
173
|
-
Sample documents may number 30 to 1000+, are heterogeneous, and get
|
|
174
|
-
read many times during testing. **Don't try to read every sample
|
|
175
|
-
whole.** Use rule-applicability filters or sampled subsets to focus
|
|
176
|
-
attention.
|
|
140
|
+
Sample documents may number 30 to 1000+, are heterogeneous, and get read many times during testing. **Don't try to read every sample whole.** Use rule-applicability filters or sampled subsets to focus attention.
|
|
177
141
|
|
|
178
|
-
### Escape valve — when a single
|
|
142
|
+
### Escape valve — when a single source exceeds ~200K chars
|
|
179
143
|
|
|
180
|
-
Rare in practice
|
|
181
|
-
typical Chinese banking regs (资管新规, 信披办法, etc.) all fit
|
|
182
|
-
under 50 KB. But if you do encounter a single regulation so large
|
|
183
|
-
that reading it whole would crowd the context window — heuristic:
|
|
184
|
-
the file exceeds ~200,000 chars or ~25% of your context budget —
|
|
185
|
-
use your own judgment:
|
|
144
|
+
Rare in practice — most regulation, handbook, or rule-table documents fit comfortably under 50 KB. But if you do encounter a single source document so large that reading it whole would crowd the context window — heuristic: the file exceeds ~200,000 chars or ~25% of your context budget — use your own judgment:
|
|
186
145
|
|
|
187
|
-
- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse`
|
|
188
|
-
|
|
189
|
-
- Or build an in-workspace index file pointing to chapter offsets and
|
|
190
|
-
read on-demand per rule being extracted
|
|
146
|
+
- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse` or paginated `workspace_file` reads
|
|
147
|
+
- Or build an in-workspace index file pointing to chapter offsets and read on-demand per rule being extracted
|
|
191
148
|
|
|
192
|
-
The 50 KB cap is high enough that this almost never triggers. **The
|
|
193
|
-
default is read whole; deviate only when the file genuinely doesn't
|
|
194
|
-
fit.**
|
|
149
|
+
The 50 KB cap is high enough that this almost never triggers. **The default is read whole; deviate only when the file genuinely doesn't fit.**
|
|
195
150
|
|
|
196
151
|
## Extraction Strategies
|
|
197
152
|
|
|
@@ -202,11 +157,14 @@ When the developer user provides rules in xlsx, csv, or a structured document wh
|
|
|
202
157
|
- Map each row to a rule, preserving the developer user's identifiers.
|
|
203
158
|
- Ask clarifying questions only if entries are ambiguous.
|
|
204
159
|
|
|
205
|
-
### Strategy 2: Hierarchical Extraction from
|
|
160
|
+
### Strategy 2: Hierarchical Extraction from Source Text
|
|
206
161
|
|
|
207
|
-
For raw
|
|
162
|
+
For raw source documents (PDF, DOCX, legal text, handbooks, case collections):
|
|
208
163
|
|
|
209
164
|
1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
|
|
165
|
+
|
|
166
|
+
Before extracting any rule, traverse the table of contents and section headers end-to-end. Sketch the rule-bearing hierarchy: which chapters impose obligations, which are definitions / context. A common failure mode: a long source with many articles yields disproportionately few rules — almost always meaning you stopped surveying after the high-density chapters. Decide your rule-bearing chapter span explicitly, then justify deviations relative to that span rather than to a single global count target.
|
|
167
|
+
|
|
210
168
|
2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
|
|
211
169
|
3. **Peel the onion.** Start at the highest structural level and work downward:
|
|
212
170
|
- Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
|
|
@@ -216,7 +174,7 @@ For raw regulation documents (PDF, DOCX, legal text):
|
|
|
216
174
|
4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
|
|
217
175
|
5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
|
|
218
176
|
|
|
219
|
-
For long documents
|
|
177
|
+
For long documents, use the onion-peeler approach — see the `document-chunking` skill for the full strategy and the wedge-driving fallback for sections without clear headers. Do not try to read the entire document in one pass.
|
|
220
178
|
|
|
221
179
|
### Strategy 3: Expert Notes
|
|
222
180
|
|
|
@@ -285,6 +243,8 @@ Do not skip ambiguous rules. They are often the most important ones.
|
|
|
285
243
|
|
|
286
244
|
## Sanity-check applicability against the sample corpus
|
|
287
245
|
|
|
246
|
+
> This is a validation pass, not a discovery pass. Do not let 0-sample rules tempt you to delete them at this stage — first ask whether the source requires them; if yes, keep them as "future scope" rather than drop.
|
|
247
|
+
|
|
288
248
|
After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
|
|
289
249
|
|
|
290
250
|
For every rule:
|
|
@@ -292,14 +252,52 @@ For every rule:
|
|
|
292
252
|
2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
|
|
293
253
|
3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
|
|
294
254
|
|
|
295
|
-
|
|
255
|
+
A failure mode worth flagging: a catalog where a large fraction of rules (say 30-40%) return `PASS=0 FAIL=0 NOT_APPLICABLE=all` across the entire sample set. Some inactive rules are legitimate (the source requires checks for a product type the corpus doesn't happen to contain), but a high inactive ratio almost always signals scope-too-narrow drift — applicability filters that over-specify.
|
|
296
256
|
|
|
297
257
|
If many rules are 0-sample, either:
|
|
298
258
|
- **Reframe their applicability** — broaden product types, look for evidence in headers/footers not just body, relax the scope filter
|
|
299
259
|
- **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
|
|
300
260
|
- **Update the test corpus** to include matching samples (work with the developer user)
|
|
301
261
|
|
|
302
|
-
Catching this in `rule_extraction` is much cheaper than authoring
|
|
262
|
+
Catching this in `rule_extraction` is much cheaper than authoring N skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
|
|
263
|
+
|
|
264
|
+
## Logic-type taxonomy (coverage diagnostic)
|
|
265
|
+
|
|
266
|
+
After first-pass extraction, classify each rule by judgment type:
|
|
267
|
+
|
|
268
|
+
- **Threshold** — numeric comparison ("annualized rate ≥ 15.4%")
|
|
269
|
+
- **Decision-Tree** — multi-branch ("if product type ∈ {A, B} then ...")
|
|
270
|
+
- **Heuristic** — semantic judgment ("does marketing copy imply principal guarantee")
|
|
271
|
+
- **Process** — procedural compliance ("published within the required deadline")
|
|
272
|
+
|
|
273
|
+
If your catalog is 90% Threshold rules, you have likely missed the semantic / process obligations that don't reduce to a number. Re-survey for those. The four types are roughly comparable in frequency across most rule corpora; a heavy skew is a signal to look again at the chapters or sections you skimmed.
|
|
274
|
+
|
|
275
|
+
## Preserve specifics (anti-summarize)
|
|
276
|
+
|
|
277
|
+
When writing a rule's `description` and `falsifiability_statement`, preserve every threshold, percentage, deadline, and named entity from the source. "Disclose within a reasonable period" is a vague rule and will fail downstream — the source almost certainly says "within 15 business days." If the source IS genuinely vague, flag the ambiguity explicitly (e.g., `notes: "source uses '及时'; no numeric deadline"`) rather than smoothing it. Downstream skill-authoring will need the specifics to write check.py logic.
|
|
278
|
+
|
|
279
|
+
## Soft sample-access discipline
|
|
280
|
+
|
|
281
|
+
You have unlimited tool access to samples — KC does not cap you. The discipline is procedural: source-extraction phase first, then validation phase. Inside source-extraction phase, samples are a last-resort reference for clarifying terminology, not a discovery surface. If you find yourself opening sample N° 3 to figure out what to extract next, you have inverted the methodology — close the sample, return to the source. Acceptable narrow exceptions:
|
|
282
|
+
- A jargon term in the source needs example resolution
|
|
283
|
+
- Sanity-check that a rule's `description` field reads coherently when applied to a real document
|
|
284
|
+
|
|
285
|
+
## Primary vs auxiliary sources — iteration order, NOT coverage breadth
|
|
286
|
+
|
|
287
|
+
When the developer user labels some source documents "primary" and others "auxiliary" (or "supplementary", or "secondary"), that distinction is about **iteration order**: do the primary regs deeply first, then come back to the auxiliary ones. It is **NOT** a license to skip the auxiliary regs entirely.
|
|
288
|
+
|
|
289
|
+
A recurring failure mode worth flagging: agent reads "primary 01-02 are the main basis, the rest is auxiliary" and produces 13 rules from regs 01-02 + 2 rules from regs 03-04 + zero rules from regs 05-10. The auxiliary regulations (often 60-90 articles each in compliance domains) almost always contain core obligations the primary regs reference or assume. Extracting nothing from them produces a thin catalog that misses real compliance requirements.
|
|
290
|
+
|
|
291
|
+
The right interpretation: primary regs get the first deep pass, the auxiliary regs get a structural-survey pass at minimum — identify their core obligations and extract those, even if not at the same density as primary. Skipping a 80-article regulation entirely should require an explicit reason in `coverage_audit.md` (e.g., "regulation 05 covers fund operations outside our case scope; explicitly out-of-scope per user discussion"). Silent skipping is the failure mode.
|
|
292
|
+
|
|
293
|
+
## Coverage trace (recommended deliverable)
|
|
294
|
+
|
|
295
|
+
After extraction, walk the source document paragraph-by-paragraph and tag each as either:
|
|
296
|
+
|
|
297
|
+
- `covered_by: [Rxxx, Ryyy]` — articles whose obligations became one or more rules
|
|
298
|
+
- `non_checkable: definition | context | cross_ref | scope` — articles excluded with explicit reason
|
|
299
|
+
|
|
300
|
+
Write this as `rules/coverage_trace.md` (or a section in `coverage_audit.md`). This is the source-side mirror of the existing sample-side applicability check, and catches the "long source → suspiciously few rules" failure mode directly. Engine derivation can read this trace to validate completeness later.
|
|
303
301
|
|
|
304
302
|
## When Rules Change
|
|
305
303
|
|
|
@@ -1,80 +1,9 @@
|
|
|
1
|
-
# Chunking Strategies
|
|
1
|
+
# Chunking Strategies
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
The chunking methodology (onion-peeler + wedge fallback + balance
|
|
4
|
+
heuristics) now lives in the `document-chunking` skill. Consult that
|
|
5
|
+
skill directly when designing chunking for rule extraction or any
|
|
6
|
+
downstream processing.
|
|
4
7
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
|
|
8
|
-
|
|
9
|
-
### How It Works
|
|
10
|
-
|
|
11
|
-
1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
|
|
12
|
-
2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
|
|
13
|
-
3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
|
|
14
|
-
4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
|
|
15
|
-
5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
|
|
16
|
-
|
|
17
|
-
### Why This Works
|
|
18
|
-
|
|
19
|
-
- Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
|
|
20
|
-
- Minimizes information loss. You never cut in the middle of a thought.
|
|
21
|
-
- Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
|
|
22
|
-
|
|
23
|
-
### Pattern Discovery Shortcut
|
|
24
|
-
|
|
25
|
-
Before building a full parser, explore several sample documents for structural patterns:
|
|
26
|
-
- Do all chapter titles start with "Chapter X" or "第X章"?
|
|
27
|
-
- Are sections numbered consistently (1.1, 1.2, 1.3)?
|
|
28
|
-
- Are there visual markers (bold text, specific fonts, horizontal rules)?
|
|
29
|
-
|
|
30
|
-
If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
|
|
31
|
-
- `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
|
|
32
|
-
- `^Chapter \d+` for English chapter headers
|
|
33
|
-
- `^\d+\.\d+` for numbered sections
|
|
34
|
-
|
|
35
|
-
Always validate the regex against multiple documents before committing to it.
|
|
36
|
-
|
|
37
|
-
## Wedge Driving (Fallback Strategy)
|
|
38
|
-
|
|
39
|
-
For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
|
|
40
|
-
|
|
41
|
-
### How It Works
|
|
42
|
-
|
|
43
|
-
The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
|
|
44
|
-
|
|
45
|
-
**Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
|
|
46
|
-
|
|
47
|
-
**Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
|
|
48
|
-
- `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
|
|
49
|
-
- `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
|
|
50
|
-
- `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
|
|
51
|
-
|
|
52
|
-
Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
|
|
53
|
-
|
|
54
|
-
**Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
|
|
55
|
-
1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
|
|
56
|
-
2. The cut position is immediately after the matched `tokens_before` region.
|
|
57
|
-
3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
|
|
58
|
-
|
|
59
|
-
**Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
|
|
60
|
-
|
|
61
|
-
### Why This Works
|
|
62
|
-
|
|
63
|
-
- The LLM identifies semantic boundaries, not arbitrary character counts.
|
|
64
|
-
- The LLM never regenerates text — it only quotes positions. No hallucination risk.
|
|
65
|
-
- K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
|
|
66
|
-
- The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
|
|
67
|
-
- Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
|
|
68
|
-
|
|
69
|
-
### When to Use
|
|
70
|
-
|
|
71
|
-
- Only when the onion peeler cannot split further (no sub-headers available).
|
|
72
|
-
- For documents with no structural markup at all.
|
|
73
|
-
- Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
|
|
74
|
-
|
|
75
|
-
## Practical Guidelines
|
|
76
|
-
|
|
77
|
-
- **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
|
|
78
|
-
- **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
|
|
79
|
-
- **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
|
|
80
|
-
- **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
|
|
8
|
+
This stub remains for legacy references; new content goes to
|
|
9
|
+
`document-chunking`.
|
|
@@ -43,9 +43,9 @@ When grouping, name the file with the explicit range so downstream consumers (wo
|
|
|
43
43
|
|
|
44
44
|
### Anti-pattern: the unified runner
|
|
45
45
|
|
|
46
|
-
If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all
|
|
46
|
+
If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
|
|
47
47
|
|
|
48
|
-
|
|
48
|
+
A failure pattern worth flagging: an agent writes a unified runner like `unified_qc.py` to bypass individual skills it doesn't trust. The result is cascading errors — a single rule's failure corrupts every other rule's verdict, easily producing 15%+ error rates across thousands of production checks. The unified runner feels productive locally and is a global mistake. It also stalls the phase model: with no individual `check.py` files landing on disk, the engine can't credit the work toward milestone completion.
|
|
49
49
|
|
|
50
50
|
If individual skills aren't running cleanly, the right response is to identify which ones break and fix them, not consolidate. The whole pipeline (extraction → skill_testing → distillation → production_qc) assumes one rule = one verifiable artifact.
|
|
51
51
|
|
|
@@ -53,20 +53,22 @@ If individual skills aren't running cleanly, the right response is to identify w
|
|
|
53
53
|
|
|
54
54
|
Each rule_skill folder MUST have BOTH a substantive `SKILL.md` AND a substantive `check.py` (or `check.py` that imports + calls a workflow that does the real work). One side being a stub breaks the contract.
|
|
55
55
|
|
|
56
|
-
**Variant 1
|
|
56
|
+
**Variant 1**: stub `SKILL.md` (a templated ~20-line scaffold with `检查逻辑: N/A` or equivalent) paired with a real `check.py` (regex methodology embedded in code). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The methodology has been pushed entirely into `check.py` comments — works for the engine, loses the deliverable framing.
|
|
57
57
|
|
|
58
|
-
**Variant 2
|
|
58
|
+
**Variant 2**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, source cross-refs) paired with stub `check.py` (a thin scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
|
|
59
59
|
|
|
60
|
-
**Variant 3 (legacy
|
|
60
|
+
**Variant 3 (legacy)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with an otherwise-real SKILL.md. Methodology described but never executable.
|
|
61
|
+
|
|
62
|
+
**Variant 4 (the "monolithic verify engine" stub)**: per-rule SKILL.md is a thin 20-35 line scaffold ("see verify_engine.py"), per-rule check.py is a thin shim that imports + calls a single monolithic `rule_skills/verify_engine.py` (or similar root-level file) that holds all 15-20 rules' verification logic in one ~750-LOC file. Each per-rule check.py looks like `from rule_skills.verify_engine import check_R01_01; return check_R01_01(doc)`. This passes the "check.py is not literally a stub" surface check but inverts the canonical per-rule granularity: per-rule files contain no rule-specific reasoning, the monolith holds everything, and the read-this-skill-to-understand-this-rule workflow fails for everyone (developer user, future auditor, downstream agent). The contract says skills are KC's unit of per-rule granularity for a reason. Centralizing all check logic into one big file may look efficient but loses the per-rule auditability that's the whole point. If you find yourself writing `verify_engine.py` with 15+ check functions and stub SKILL.md/check.py per rule, stop — keep the methodology in each rule's SKILL.md (substantive) and either inline the check logic or use the canonical per-rule check.py + workflow_v1.py pattern.
|
|
61
63
|
|
|
62
64
|
**The contract**:
|
|
63
|
-
- ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not
|
|
65
|
+
- ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not a 20-line template.
|
|
64
66
|
- ✓ DO: check.py implements the check. EITHER substantive direct logic OR `from workflows.<rule_id>.workflow_v1 import verify` + delegate. Returns concrete verdicts.
|
|
65
67
|
- ✗ DON'T: stub SKILL.md with methodology in check.py comments (variant 1).
|
|
66
68
|
- ✗ DON'T: substantive SKILL.md with check.py that returns NOT_APPLICABLE without delegating to a workflow (variant 2).
|
|
67
69
|
- ✗ DON'T: stub check.py returning null verdict (variant 3, legacy).
|
|
68
70
|
|
|
69
|
-
|
|
71
|
+
The engine may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
|
|
70
72
|
|
|
71
73
|
## Writing SKILL.md
|
|
72
74
|
|
|
@@ -139,7 +141,7 @@ def check(document_text):
|
|
|
139
141
|
|
|
140
142
|
Recognized prefixes (Chinese + English variants): 预期命中点, 预期结果, 预期判定, 预期验证, 标注, 审核标注, Expected, expected, EXPECTED, Annotation, annotation. Pass `extra_prefixes=("..."、"...")` if your project uses different labels.
|
|
141
143
|
|
|
142
|
-
|
|
144
|
+
A recurring failure mode worth flagging: a non-trivial fraction of rules have standalone check.py false-positive PASS on violation samples because the regex matches the `预期命中点: ...` annotation footer instead of the document body. KC ships the helper as a template file so this trap is one import away from being avoided.
|
|
143
145
|
|
|
144
146
|
## Writing References
|
|
145
147
|
|
|
@@ -157,6 +159,48 @@ Keep references factual and sourced. They are evidence, not instructions.
|
|
|
157
159
|
- **samples.json**: Annotated examples. Each entry: the input (extracted text or entity), the expected result (pass/fail/missing), and the expected comment. Build this incrementally as you test.
|
|
158
160
|
- **corner_cases.json**: Edge cases that the standard logic does not handle. Each entry: description, detection pattern, resolution, and confidence threshold. See the `corner-case-management` skill for the methodology.
|
|
159
161
|
|
|
162
|
+
## Authoring methodology (from skill-creator core)
|
|
163
|
+
|
|
164
|
+
This section folds in the universal authoring patterns from Anthropic's upstream `skill-creator`. Apply them on top of the KC-specific layout above when drafting any new rule skill.
|
|
165
|
+
|
|
166
|
+
### Capture intent before drafting
|
|
167
|
+
|
|
168
|
+
Before writing any skill, get clear on four questions:
|
|
169
|
+
|
|
170
|
+
1. **What should this skill enable Claude (or check.py) to do?** — A single concrete capability, not a category.
|
|
171
|
+
2. **When should it trigger?** — What user phrases / document contexts should match its description.
|
|
172
|
+
3. **What's the expected output format?** — verdict + comment + evidence shape; or for non-check skills, the deliverable shape.
|
|
173
|
+
4. **Do we need test samples?** — If the rule has objective pass/fail criteria (almost all KC rules do), yes. Build `assets/samples.json` incrementally as edge cases appear.
|
|
174
|
+
|
|
175
|
+
If the conversation already contains worked examples (a user pointed at a passage and said "this is non-compliant"), extract answers from history first — don't ask the user to repeat themselves.
|
|
176
|
+
|
|
177
|
+
### Frontmatter and progressive disclosure
|
|
178
|
+
|
|
179
|
+
Skills load in three tiers; budget each tier for what it has to carry:
|
|
180
|
+
|
|
181
|
+
1. **Metadata (name + description)** — always in the agent's context. ~100 words. This is the *primary triggering mechanism* — make it specific and slightly "pushy" (Claude tends to under-trigger skills). Include trigger keywords, rule ID, the regulation it derives from, and the document location it expects to find evidence in.
|
|
182
|
+
2. **SKILL.md body** — loaded when the skill triggers. Target 100–300 lines for typical rules, hard ceiling 500. Explain the WHY behind the rule, not just the mechanics.
|
|
183
|
+
3. **Bundled resources** (scripts/, references/, assets/) — loaded only when the body explicitly points to them. Big regulation excerpts and sample corpora belong here, not inline.
|
|
184
|
+
|
|
185
|
+
If SKILL.md is approaching 500 lines, that's the cue to push detail down into `references/` and leave a pointer like "See references/edge-cases.md for the full enumeration of corner cases."
|
|
186
|
+
|
|
187
|
+
### Writing style
|
|
188
|
+
|
|
189
|
+
- **Imperative over passive.** "Extract the ratio" not "the ratio should be extracted."
|
|
190
|
+
- **Explain why, not just what.** Today's LLMs have good theory of mind — a one-line "the regulation flags this to protect retail investors" makes a downstream agent generalize correctly to a case you didn't enumerate.
|
|
191
|
+
- **Be wary of all-caps MUSTs and NEVERs.** If you find yourself reaching for them, that's usually a sign the underlying reasoning hasn't been made explicit. Reframe and explain.
|
|
192
|
+
- **Be specific about location.** "Look in Chapter 2, Section 'Key Regulatory Metrics' or the summary table on page 1" beats "look in the financial disclosures somewhere."
|
|
193
|
+
|
|
194
|
+
### Test samples before scripts
|
|
195
|
+
|
|
196
|
+
After drafting SKILL.md, write 2–3 realistic sample inputs into `assets/samples.json` with their expected verdicts BEFORE you finalise `check.py`. The samples ground the script: every regex or keyword you add should be there to make a specific sample produce its expected verdict. Writing scripts in the abstract — without samples — almost always produces over-fitted code that fails on the first real document.
|
|
197
|
+
|
|
198
|
+
### Iterate, don't perfect
|
|
199
|
+
|
|
200
|
+
A rule skill rarely lands correctly on the first draft. Plan for at least one revision after testing surfaces problems. Don't pile on defensive MUSTs to handle every edge case — generalize the methodology. If three samples each needed a different one-off fix, that's a signal the underlying rule statement is too narrow.
|
|
201
|
+
|
|
202
|
+
If your skill needs more sophisticated methodology than this section covers — formal eval loops with quantitative benchmarks, blind A/B comparison between skill versions, or description-optimization runs — consult `skill-creator`.
|
|
203
|
+
|
|
160
204
|
## Iteration
|
|
161
205
|
|
|
162
206
|
Skills evolve through testing. After each test iteration:
|
|
@@ -1,10 +1,22 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: skill-creator
|
|
3
3
|
tier: meta
|
|
4
|
-
description:
|
|
4
|
+
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
-
# Skill
|
|
7
|
+
# Skill creator (KC fallback)
|
|
8
|
+
|
|
9
|
+
This is the upstream Anthropic `skill-creator` methodology, vendored into
|
|
10
|
+
KC as a deep-reference for sophisticated skill authoring. KC's primary
|
|
11
|
+
skill-authoring path is the `skill-authoring` skill (which inherits the
|
|
12
|
+
core ideas from this skill). Only consult `skill-creator` directly when
|
|
13
|
+
`skill-authoring`'s guidance feels insufficient for the complexity of the
|
|
14
|
+
skill you need to write — e.g., when you want to run formal eval loops,
|
|
15
|
+
benchmark with variance analysis, or optimise the trigger description.
|
|
16
|
+
For ordering and grouping decisions about which skills to author first,
|
|
17
|
+
see `work-decomposition`.
|
|
18
|
+
|
|
19
|
+
---
|
|
8
20
|
|
|
9
21
|
A skill for creating new skills and iteratively improving them.
|
|
10
22
|
|
|
@@ -392,7 +404,7 @@ Use the model ID from your system prompt (the one powering the current session)
|
|
|
392
404
|
|
|
393
405
|
While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
|
|
394
406
|
|
|
395
|
-
This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude
|
|
407
|
+
This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
|
|
396
408
|
|
|
397
409
|
### How skill triggering works
|
|
398
410
|
|
|
@@ -436,6 +448,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
|
|
|
436
448
|
|
|
437
449
|
**Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
|
|
438
450
|
|
|
451
|
+
**Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
|
|
452
|
+
- **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
|
|
453
|
+
- **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
|
|
454
|
+
- **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
|
|
455
|
+
|
|
439
456
|
---
|
|
440
457
|
|
|
441
458
|
## Cowork-Specific Instructions
|
|
@@ -448,6 +465,7 @@ If you're in Cowork, the main things to know are:
|
|
|
448
465
|
- Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
|
|
449
466
|
- Packaging works — `package_skill.py` just needs Python and a filesystem.
|
|
450
467
|
- Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
|
|
468
|
+
- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
|
|
451
469
|
|
|
452
470
|
---
|
|
453
471
|
|
|
@@ -478,3 +496,7 @@ Repeating one more time the core loop here for emphasis:
|
|
|
478
496
|
Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
|
|
479
497
|
|
|
480
498
|
Good luck!
|
|
499
|
+
|
|
500
|
+
---
|
|
501
|
+
|
|
502
|
+
*Methodology adapted from [anthropics/skills](https://github.com/anthropics/skills) (Apache 2.0).*
|
|
@@ -70,7 +70,7 @@ If you do escalate to LLM:
|
|
|
70
70
|
- **tier2-3**: bulk extraction with simple semantic checks
|
|
71
71
|
- **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
|
|
72
72
|
|
|
73
|
-
|
|
73
|
+
A recurring failure pattern worth flagging: agents default to all-regex distillation and only add LLM escalation when the human user explicitly asks for a "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
|
|
74
74
|
|
|
75
75
|
## Workflow Structure
|
|
76
76
|
|
|
@@ -148,6 +148,17 @@ All numbers here (10 documents, 5 percentage points, etc.) are recommended start
|
|
|
148
148
|
|
|
149
149
|
This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
|
|
150
150
|
|
|
151
|
+
### Picking the model inside a tier — quick reference
|
|
152
|
+
|
|
153
|
+
The tier framework above answers "which tier is right for this step?". Inside a tier slot, "which specific model?" still matters. A few heuristics that hold today (refresh from `auto-model-selection` — specifics change in months, not years):
|
|
154
|
+
|
|
155
|
+
- **Tier 1 / Tier 2 worker workhorse**: the current-generation flagship MoE LLMs (200-400B total / ~20B activated experts) are a reasonable starting baseline. Qwen's family flagship and DeepSeek's current premium model are both in this shape; either works.
|
|
156
|
+
- **Tier 3 / Tier 4 small model**: prefer Qwen family for sub-30B options — many cheap, reliable choices. Skip the `coder` / `code` named variants at small sizes (unreliable for general worker tasks). Prefer no-thinking-mode variants when available; these tasks don't benefit from reflection.
|
|
157
|
+
- **Provider stacking**: routing conductor and worker through different providers can isolate per-model throttle / rate-limit exposure (e.g., DeepSeek for workers, SiliconFlow for conductor).
|
|
158
|
+
- **VLM / OCR**: characters / handwriting / seals → dedicated OCR model (Paddle-OCR, GLM-OCR, DeepSeek-OCR or their successors). Complex graphs / tables → larger general VLM.
|
|
159
|
+
|
|
160
|
+
For up-to-date facts (exact model names, context windows, pricing), consult `auto-model-selection` and use Context7. The heuristics above go stale fast — the *shape* (MoE flagship for workhorse, sub-30B non-thinking for cheap bulk, OCR-specific for chars) is what stays.
|
|
161
|
+
|
|
151
162
|
## Testing Against Ground Truth
|
|
152
163
|
|
|
153
164
|
The coding agent's skill-based results are the ground truth. For each document in Samples/:
|
|
@@ -190,7 +201,7 @@ See `references/worker-llm-catalog.md` for current model capabilities and contex
|
|
|
190
201
|
|
|
191
202
|
## Two access paths: `worker_llm_call` tool (preferred) vs direct HTTP
|
|
192
203
|
|
|
193
|
-
KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit.
|
|
204
|
+
KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. A batch mode is supported:
|
|
194
205
|
|
|
195
206
|
```
|
|
196
207
|
worker_llm_call({
|
|
@@ -203,7 +214,7 @@ worker_llm_call({
|
|
|
203
214
|
|
|
204
215
|
Returns a `{n_total, n_succeeded, n_failed, total_tokens_in, total_tokens_out, results: [...]}` summary. Partial failures don't fail the whole batch.
|
|
205
216
|
|
|
206
|
-
### The canonical `workflows/common/llm_client.py` (
|
|
217
|
+
### The canonical `workflows/common/llm_client.py` (shipped from template)
|
|
207
218
|
|
|
208
219
|
For a workflow that runs **standalone** (no KC session — e.g., a customer deploys the release bundle and runs `python run.py doc.pdf`), the workflow has no access to `worker_llm_call`. The canonical HTTP client shim ships as a template file and is auto-populated into every workspace's `workflows/common/llm_client.py` at engine init. **Do not write your own.** Use the file that's already there:
|
|
209
220
|
|
|
@@ -226,7 +237,15 @@ What the shim does:
|
|
|
226
237
|
- Writes a line to `output/llm_ledger.jsonl` per call so KC audits can reconstruct cost even when worker_llm_call wasn't used
|
|
227
238
|
- Raises an explicit error if `LLM_BASE_URL` is missing (no silent fallback to a hardcoded vendor)
|
|
228
239
|
|
|
229
|
-
**Don't write your own llm_client.py from scratch.**
|
|
240
|
+
**Don't write your own llm_client.py from scratch.** A recurring failure mode worth flagging: agents repeatedly roll their own shim — buggy (stale model IDs, hardcoded vendor URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
|
|
241
|
+
|
|
242
|
+
### Worse anti-pattern: writing a parallel hand-rolled client AT THE SAME TIME as the canonical one
|
|
243
|
+
|
|
244
|
+
A persistent failure pattern across runs: the canonical `workflows/common/llm_client.py` is present (engine auto-populated it at init), AND the agent ALSO writes its own `workflows/llm_client.py` or `verify_engine_v2.py` with a `requests.post(...)` HTTP call. The agent then uses the hand-rolled one for all the real LLM work. Both files sit in the workspace. Engine cost-tracking sees nothing.
|
|
245
|
+
|
|
246
|
+
When this happens, three things go wrong: (1) Provider routing breaks. The hand-rolled client typically reads `LLM_BASE_URL` from workspace `.env` — which is the **conductor**'s endpoint. KC's worker routing (via `worker_llm_call` and the engine's worker_* config) gets bypassed entirely. If the operator configured workers to a separate provider (say, DeepSeek workers + SiliconFlow conductor), the hand-rolled client wrongly hits the conductor's provider with the worker's model names — produces 400s or worse, silent wrong-model results. (2) Cost / audit visibility is lost. Engine doesn't see the calls; neither does `output/llm_ledger.jsonl` (the canonical client's ledger isn't written by the hand-rolled one). The session looks like it did zero LLM work, but the actual LLM bill exists. (3) Rate-limit / retry / timeout behavior diverges. The canonical client + worker_llm_call inherit engine-level resilience patterns (AbortSignal.timeout, withRetry on 429/5xx, etc.). A hand-rolled `requests.post` has none of that — it stalls, throws ad-hoc errors, or silently corrupts the run.
|
|
247
|
+
|
|
248
|
+
**Rule of thumb**: if the verification rule needs an LLM judgment, your two valid options are `worker_llm_call` (when running inside a KC session) or `from workflows.common.llm_client import call` (when running standalone from a release bundle). If you find yourself typing `import requests` or `urllib.request.urlopen` for an LLM call, stop. That code path will be flagged in audit as a recurring adoption miss and rewritten — save the round trip and use the right tool the first time.
|
|
230
249
|
|
|
231
250
|
## sandbox_exec timeout for known-slow commands
|
|
232
251
|
|
|
@@ -126,7 +126,7 @@ Five failure modes recur across projects. Learn to recognize them early.
|
|
|
126
126
|
|
|
127
127
|
## Integration
|
|
128
128
|
|
|
129
|
-
Task decomposition sits between rule extraction and skill authoring in the KC
|
|
129
|
+
Task decomposition sits between rule extraction and skill authoring in the KC lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
|
|
130
130
|
|
|
131
131
|
**Input**: A rule catalog from `rule-extraction`. Each rule is an atomic, testable verification requirement. If a rule is not yet atomic, send it back to rule extraction for further decomposition before attempting task decomposition.
|
|
132
132
|
|