kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/package.json +1 -1
  2. package/src/agent/context.js +17 -1
  3. package/src/agent/engine.js +85 -8
  4. package/src/agent/llm-client.js +24 -1
  5. package/src/agent/pipelines/_milestone-derive.js +78 -7
  6. package/src/agent/pipelines/skill-authoring.js +19 -2
  7. package/src/agent/tools/release.js +94 -1
  8. package/src/cli/index.js +28 -7
  9. package/template/.env.template +1 -1
  10. package/template/AGENT.md +2 -2
  11. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  12. package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
  13. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  14. package/template/skills/en/confidence-system/SKILL.md +30 -8
  15. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  16. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  17. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  18. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  19. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  20. package/template/skills/en/document-chunking/SKILL.md +99 -15
  21. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  22. package/template/skills/en/quality-control/SKILL.md +14 -0
  23. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  24. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  25. package/template/skills/en/skill-authoring/SKILL.md +52 -8
  26. package/template/skills/en/skill-creator/SKILL.md +25 -3
  27. package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
  28. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  29. package/template/skills/en/tree-processing/SKILL.md +1 -1
  30. package/template/skills/en/version-control/SKILL.md +15 -0
  31. package/template/skills/en/work-decomposition/SKILL.md +21 -35
  32. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  33. package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
  34. package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
  35. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  36. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  37. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  38. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  39. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  40. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  41. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  42. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  43. package/template/skills/zh/document-chunking/SKILL.md +96 -20
  44. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  45. package/template/skills/zh/entity-extraction/SKILL.md +14 -4
  46. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  47. package/template/skills/zh/quality-control/SKILL.md +14 -0
  48. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  49. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  50. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  51. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  52. package/template/skills/zh/skill-authoring/SKILL.md +108 -69
  53. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  54. package/template/skills/zh/skill-creator/SKILL.md +71 -61
  55. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  56. package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
  57. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  58. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  59. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  60. package/template/skills/zh/tree-processing/SKILL.md +1 -1
  61. package/template/skills/zh/version-control/SKILL.md +15 -0
  62. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  63. package/template/skills/zh/work-decomposition/SKILL.md +21 -33
@@ -16,6 +16,12 @@ Data/entity extraction (`entity-extraction`) is the **repeating task** that runs
16
16
 
17
17
  Don't conflate the two. Rule extraction happens once; data extraction happens on every document.
18
18
 
19
+ ## Source-first sequencing
20
+
21
+ Extract rules from the source text FIRST. Only after you have a complete first-pass catalog from sources alone should you open sample documents. The temptation is to peek at samples early to "see what kinds of rules matter" — this biases you toward rules the samples happen to exercise and silently drops rules the samples don't cover.
22
+
23
+ A domain professional reads the source material, builds an understanding, then validates on samples — not the reverse. KC's differentiator over general-purpose agents is systematic accuracy across long context; that advantage compounds when you ground in the SOURCE not the EXAMPLES.
24
+
19
25
  ## Rule Structure: Location → Extraction → Judgment
20
26
 
21
27
  Every verification rule decomposes into three parts:
@@ -62,22 +68,17 @@ When rules change (additions, modifications, deprecations), version the entire r
62
68
 
63
69
  ## Granularity Calibration (read before extracting)
64
70
 
65
- A well-extracted rule catalog has **10-20 rules per typical regulation PDF**
66
- (a 30-80 page disclosure regulation). Over-extraction into 60-100 rules per
67
- regulation signals you're treating every clause as its own rule downstream
68
- consumers (skill-authoring, workflow-run) can't distinguish meaningful
69
- checks from boilerplate.
70
-
71
- If your first pass produces more than ~25 rules for a single regulation:
72
- - **Merge rules that share evidence and fail together** (e.g., "must
73
- disclose X" and "must disclose Y" where both come from the same
74
- required-fields table one rule: "must disclose the required-fields
75
- list including X, Y").
76
- - **Drop procedural language** that isn't checkable against a report
77
- (definitions, scope statements, references to other regs that just
78
- transitively apply).
79
- - **Keep only checkable obligations, prohibitions, and thresholds** —
80
- things where you can read a sample report and say pass or fail.
71
+ Rule catalogs come from diverse source materials — formal regulations, internal handbooks, case law, legal opinions, expert rule tables, regulator Q&A. There is no universal "right number of rules per page". Calibrate by logic, not by count:
72
+
73
+ - **Atomicity is the real test.** A rule that can produce two independent pass/fail outcomes is two rules. A rule whose verdict requires verifying three different paragraphs of the source is probably three rules.
74
+ - **Boilerplate is not a rule.** Definitions, scope statements, transitive references to other regulations, and procedural language that can't be checked against the target document do not become rules.
75
+ - **Keep only checkable obligations, prohibitions, and thresholds** — things where you can read a target document and say pass / fail / not-applicable.
76
+
77
+ If your first pass feels too coarse (one rule per chapter, ignoring multiple distinct obligations within) — go finer. If it feels too fine (every clause in a definitions section is its own rule) — merge or drop. Then:
78
+
79
+ - **Merge rules that share evidence and fail together** (e.g., "must disclose X" and "must disclose Y" where both come from the same required-fields table → one rule: "must disclose the required-fields list including X, Y").
80
+ - **Drop procedural language** that isn't checkable against a target document.
81
+ - **Convert each surviving rule into a falsifiability statement** — if you can't state precisely what would make it fail, you don't have a rule yet.
81
82
 
82
83
  ### Sample "good" rule
83
84
 
@@ -94,104 +95,58 @@ If your first pass produces more than ~25 rules for a single regulation:
94
95
  }
95
96
  ```
96
97
 
97
- Note: one pass/fail outcome, a single `source_ref` to a specific clause,
98
- clear applicability scope. Skill-authoring can write `check_r014.py` from
99
- this alone.
98
+ Note: one pass/fail outcome, a single `source_ref` to a specific clause, clear applicability scope. Skill-authoring can write `check_r014.py` from this alone.
100
99
 
101
- ### Cross-regulation dedup (when working across multiple PDFs)
100
+ ### Cross-source dedup (when working across multiple documents)
102
101
 
103
- If the developer user provides N regulations, rules from later regs often
104
- duplicate cross-cutting requirements already captured by earlier ones
105
- (e.g., a 2018 generic disclosure rule vs. a 2025 specific version).
106
- Before emitting a rule from reg N:
102
+ If the developer user provides N source documents, rules from later sources often duplicate cross-cutting requirements already captured by earlier ones (e.g., a generic disclosure rule from an older regulation vs. a newer specific version of the same obligation). Before emitting a rule from source N:
107
103
 
108
- 1. **Check the existing catalog.** Use `rule_catalog` (operation: list)
109
- to see what's already there. Skip if a rule with equivalent scope +
110
- intent exists.
104
+ 1. **Check the existing catalog.** Use `rule_catalog` (operation: list) to see what's already there. Skip if a rule with equivalent scope + intent exists.
111
105
  2. **Prefer the newer / more specific source_ref** when rules overlap.
112
- 3. **If you merged rules**, record the consolidated sources in
113
- `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
106
+ 3. **If you merged rules**, record the consolidated sources in `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
114
107
 
115
108
  ### Delegation to sub-agents
116
109
 
117
- If you dispatch extraction to sub-agents (one per regulation), the
118
- sub-agent inherits ONLY its `task_description` — it cannot see your
119
- conversation or existing catalog. Therefore, when composing the brief:
120
-
121
- - **Specify the target count band** explicitly: "Extract 10-20 atomic
122
- rules from this regulation."
123
- - **Include a sample rule** in the brief body (paste the JSON above
124
- verbatim) so the sub-agent's calibration matches yours.
125
- - **Name every regulation the sub-agent should process.** If AGENT.md
126
- lists 10 core regulations, the brief must list all 10 by name, not
127
- "the core regs" as a pronoun — LLMs composing long structured briefs
128
- frequently drop items (observed in session 6304673afaa0 where reg 02
129
- was silently omitted).
130
- - **State the dedup contract**: "Rules already in the parent's catalog
131
- (R001–Rnnn) should NOT be re-extracted. If a requirement is already
132
- covered, skip it." Then pass the current catalog's ID ranges.
133
- - **Prefer `rule_catalog` create operations over sandbox_exec writes to
134
- catalog.json.** rule_catalog uses workspace file locking;
135
- sandbox_exec bypasses it and races with other writers.
136
-
137
- ## How to read regulation files (default: read whole)
138
-
139
- Regulations are the audit's authoritative basis. Every `source_ref`
140
- in your extracted rules must be verifiable against the source text.
141
- For typical regulation documents (a single file under ~50 KB / under
142
- ~100 pages), **read each regulation file whole using `workspace_file`
143
- (operation=read) in a single call**:
110
+ If you dispatch extraction to sub-agents (one per source document), the sub-agent inherits ONLY its `task_description` — it cannot see your conversation or existing catalog. Therefore, when composing the brief:
111
+
112
+ - **Anchor calibration with a concrete sample rule.** Paste the JSON above verbatim into the brief body so the sub-agent's atomicity calibration matches yours.
113
+ - **Name every source document the sub-agent should process.** If AGENT.md lists 10 core source documents, the brief must list all 10 by name, not "the core regs" as a pronoun — LLMs composing long structured briefs frequently drop items silently.
114
+ - **State the dedup contract**: "Rules already in the parent's catalog (R001–Rnnn) should NOT be re-extracted. If a requirement is already covered, skip it." Then pass the current catalog's ID ranges.
115
+ - **Prefer `rule_catalog` create operations over sandbox_exec writes to catalog.json.** rule_catalog uses workspace file locking; sandbox_exec bypasses it and races with other writers.
116
+
117
+ ## How to read source files (default: read whole)
118
+
119
+ Source documents are the catalog's authoritative basis. Every `source_ref` in your extracted rules must be verifiable against the source text. For typical source documents (a single file under ~50 KB / under ~100 pages), **read each source file whole using `workspace_file` (operation=read) in a single call**:
144
120
 
145
121
  ```js
146
- workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_regulation.md" })
122
+ workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_source.md" })
147
123
  ```
148
124
 
149
- `workspace_file.read` is capped at 50,000 chars per call, which
150
- covers virtually every individual regulation document. This is the
151
- default. **Read every regulation file whole before you start
152
- extracting rules from any of them.**
125
+ `workspace_file.read` is capped at 50,000 chars per call, which covers virtually every individual source document. This is the default. **Read every source file whole before you start extracting rules from any of them.**
153
126
 
154
127
  ### Tool choice — `workspace_file` vs `sandbox_exec`
155
128
 
156
129
  | Tool | Per-call cap | Use for |
157
130
  |---|---:|---|
158
- | `workspace_file` (read) | 50,000 chars | **full reads of regulation / rule documents** |
131
+ | `workspace_file` (read) | 50,000 chars | **full reads of source / rule documents** |
159
132
  | `sandbox_exec` (cat/head/etc) | 10,000 chars | shell commands, **not** full file reads |
160
133
 
161
- `sandbox_exec` is designed for shell commands; its 10K cap is too
162
- small for most regulations. `cat rules/01_*.md` returns only the
163
- first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` /
164
- `tail -M` to scroll the window loses positional precision and burns
165
- turns. **When you see truncation, don't fight the cap — switch
166
- tools.**
134
+ `sandbox_exec` is designed for shell commands; its 10K cap is too small for most regulations. `cat rules/01_*.md` returns only the first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` / `tail -M` to scroll the window loses positional precision and burns turns. **When you see truncation, don't fight the cap — switch tools.**
167
135
 
168
- ### Asymmetry — regs read whole, samples sampled
136
+ ### Asymmetry — sources read whole, samples sampled
169
137
 
170
- Regulations are limited (typically 1-10 files), authoritative, and
171
- read once. Read every regulation whole.
138
+ Source documents are limited (typically 1-10 files), authoritative, and read once. Read every source file whole.
172
139
 
173
- Sample documents may number 30 to 1000+, are heterogeneous, and get
174
- read many times during testing. **Don't try to read every sample
175
- whole.** Use rule-applicability filters or sampled subsets to focus
176
- attention.
140
+ Sample documents may number 30 to 1000+, are heterogeneous, and get read many times during testing. **Don't try to read every sample whole.** Use rule-applicability filters or sampled subsets to focus attention.
177
141
 
178
- ### Escape valve — when a single reg exceeds ~200K chars
142
+ ### Escape valve — when a single source exceeds ~200K chars
179
143
 
180
- Rare in practice. The largest regulation in `test_data_4` is 42 KB;
181
- typical Chinese banking regs (资管新规, 信披办法, etc.) all fit
182
- under 50 KB. But if you do encounter a single regulation so large
183
- that reading it whole would crowd the context window — heuristic:
184
- the file exceeds ~200,000 chars or ~25% of your context budget —
185
- use your own judgment:
144
+ Rare in practice most regulation, handbook, or rule-table documents fit comfortably under 50 KB. But if you do encounter a single source document so large that reading it whole would crowd the context window — heuristic: the file exceeds ~200,000 chars or ~25% of your context budget — use your own judgment:
186
145
 
187
- - Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse`
188
- or paginated `workspace_file` reads
189
- - Or build an in-workspace index file pointing to chapter offsets and
190
- read on-demand per rule being extracted
146
+ - Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse` or paginated `workspace_file` reads
147
+ - Or build an in-workspace index file pointing to chapter offsets and read on-demand per rule being extracted
191
148
 
192
- The 50 KB cap is high enough that this almost never triggers. **The
193
- default is read whole; deviate only when the file genuinely doesn't
194
- fit.**
149
+ The 50 KB cap is high enough that this almost never triggers. **The default is read whole; deviate only when the file genuinely doesn't fit.**
195
150
 
196
151
  ## Extraction Strategies
197
152
 
@@ -202,11 +157,14 @@ When the developer user provides rules in xlsx, csv, or a structured document wh
202
157
  - Map each row to a rule, preserving the developer user's identifiers.
203
158
  - Ask clarifying questions only if entries are ambiguous.
204
159
 
205
- ### Strategy 2: Hierarchical Extraction from Regulation Text
160
+ ### Strategy 2: Hierarchical Extraction from Source Text
206
161
 
207
- For raw regulation documents (PDF, DOCX, legal text):
162
+ For raw source documents (PDF, DOCX, legal text, handbooks, case collections):
208
163
 
209
164
  1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
165
+
166
+ Before extracting any rule, traverse the table of contents and section headers end-to-end. Sketch the rule-bearing hierarchy: which chapters impose obligations, which are definitions / context. A common failure mode: a long source with many articles yields disproportionately few rules — almost always meaning you stopped surveying after the high-density chapters. Decide your rule-bearing chapter span explicitly, then justify deviations relative to that span rather than to a single global count target.
167
+
210
168
  2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
211
169
  3. **Peel the onion.** Start at the highest structural level and work downward:
212
170
  - Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
@@ -216,7 +174,7 @@ For raw regulation documents (PDF, DOCX, legal text):
216
174
  4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
217
175
  5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
218
176
 
219
- For long documents (100+ pages), use the onion-peeler approach described in `references/chunking-strategies.md`. Do not try to read the entire document in one pass.
177
+ For long documents, use the onion-peeler approach see the `document-chunking` skill for the full strategy and the wedge-driving fallback for sections without clear headers. Do not try to read the entire document in one pass.
220
178
 
221
179
  ### Strategy 3: Expert Notes
222
180
 
@@ -285,6 +243,8 @@ Do not skip ambiguous rules. They are often the most important ones.
285
243
 
286
244
  ## Sanity-check applicability against the sample corpus
287
245
 
246
+ > This is a validation pass, not a discovery pass. Do not let 0-sample rules tempt you to delete them at this stage — first ask whether the source requires them; if yes, keep them as "future scope" rather than drop.
247
+
288
248
  After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
289
249
 
290
250
  For every rule:
@@ -292,14 +252,52 @@ For every rule:
292
252
  2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
293
253
  3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
294
254
 
295
- E2E #7 GLM produced a 97-rule catalog where 36 rules (37%) had `PASS=0 FAIL=0 NOT_APPLICABLE=90` across all 90 documents they never fired. Some were legit (rules for cash-management products with no cash-management samples in corpus), but 36 inactive of 97 was high enough to suggest scope-too-narrow drift.
255
+ A failure mode worth flagging: a catalog where a large fraction of rules (say 30-40%) return `PASS=0 FAIL=0 NOT_APPLICABLE=all` across the entire sample set. Some inactive rules are legitimate (the source requires checks for a product type the corpus doesn't happen to contain), but a high inactive ratio almost always signals scope-too-narrow drift — applicability filters that over-specify.
296
256
 
297
257
  If many rules are 0-sample, either:
298
258
  - **Reframe their applicability** — broaden product types, look for evidence in headers/footers not just body, relax the scope filter
299
259
  - **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
300
260
  - **Update the test corpus** to include matching samples (work with the developer user)
301
261
 
302
- Catching this in `rule_extraction` is much cheaper than authoring 36 skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
262
+ Catching this in `rule_extraction` is much cheaper than authoring N skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
263
+
264
+ ## Logic-type taxonomy (coverage diagnostic)
265
+
266
+ After first-pass extraction, classify each rule by judgment type:
267
+
268
+ - **Threshold** — numeric comparison ("annualized rate ≥ 15.4%")
269
+ - **Decision-Tree** — multi-branch ("if product type ∈ {A, B} then ...")
270
+ - **Heuristic** — semantic judgment ("does marketing copy imply principal guarantee")
271
+ - **Process** — procedural compliance ("published within the required deadline")
272
+
273
+ If your catalog is 90% Threshold rules, you have likely missed the semantic / process obligations that don't reduce to a number. Re-survey for those. The four types are roughly comparable in frequency across most rule corpora; a heavy skew is a signal to look again at the chapters or sections you skimmed.
274
+
275
+ ## Preserve specifics (anti-summarize)
276
+
277
+ When writing a rule's `description` and `falsifiability_statement`, preserve every threshold, percentage, deadline, and named entity from the source. "Disclose within a reasonable period" is a vague rule and will fail downstream — the source almost certainly says "within 15 business days." If the source IS genuinely vague, flag the ambiguity explicitly (e.g., `notes: "source uses '及时'; no numeric deadline"`) rather than smoothing it. Downstream skill-authoring will need the specifics to write check.py logic.
278
+
279
+ ## Soft sample-access discipline
280
+
281
+ You have unlimited tool access to samples — KC does not cap you. The discipline is procedural: source-extraction phase first, then validation phase. Inside source-extraction phase, samples are a last-resort reference for clarifying terminology, not a discovery surface. If you find yourself opening sample N° 3 to figure out what to extract next, you have inverted the methodology — close the sample, return to the source. Acceptable narrow exceptions:
282
+ - A jargon term in the source needs example resolution
283
+ - Sanity-check that a rule's `description` field reads coherently when applied to a real document
284
+
285
+ ## Primary vs auxiliary sources — iteration order, NOT coverage breadth
286
+
287
+ When the developer user labels some source documents "primary" and others "auxiliary" (or "supplementary", or "secondary"), that distinction is about **iteration order**: do the primary regs deeply first, then come back to the auxiliary ones. It is **NOT** a license to skip the auxiliary regs entirely.
288
+
289
+ A recurring failure mode worth flagging: agent reads "primary 01-02 are the main basis, the rest is auxiliary" and produces 13 rules from regs 01-02 + 2 rules from regs 03-04 + zero rules from regs 05-10. The auxiliary regulations (often 60-90 articles each in compliance domains) almost always contain core obligations the primary regs reference or assume. Extracting nothing from them produces a thin catalog that misses real compliance requirements.
290
+
291
+ The right interpretation: primary regs get the first deep pass, the auxiliary regs get a structural-survey pass at minimum — identify their core obligations and extract those, even if not at the same density as primary. Skipping a 80-article regulation entirely should require an explicit reason in `coverage_audit.md` (e.g., "regulation 05 covers fund operations outside our case scope; explicitly out-of-scope per user discussion"). Silent skipping is the failure mode.
292
+
293
+ ## Coverage trace (recommended deliverable)
294
+
295
+ After extraction, walk the source document paragraph-by-paragraph and tag each as either:
296
+
297
+ - `covered_by: [Rxxx, Ryyy]` — articles whose obligations became one or more rules
298
+ - `non_checkable: definition | context | cross_ref | scope` — articles excluded with explicit reason
299
+
300
+ Write this as `rules/coverage_trace.md` (or a section in `coverage_audit.md`). This is the source-side mirror of the existing sample-side applicability check, and catches the "long source → suspiciously few rules" failure mode directly. Engine derivation can read this trace to validate completeness later.
303
301
 
304
302
  ## When Rules Change
305
303
 
@@ -1,80 +1,9 @@
1
- # Chunking Strategies for Long Documents
1
+ # Chunking Strategies
2
2
 
3
- When regulation documents exceed what you can process in a single pass, use these proven strategies to decompose them into manageable chunks while preserving semantic coherence.
3
+ The chunking methodology (onion-peeler + wedge fallback + balance
4
+ heuristics) now lives in the `document-chunking` skill. Consult that
5
+ skill directly when designing chunking for rule extraction or any
6
+ downstream processing.
4
7
 
5
- ## The Onion Peeler (Primary Strategy)
6
-
7
- Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
8
-
9
- ### How It Works
10
-
11
- 1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
12
- 2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
13
- 3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
14
- 4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
15
- 5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
16
-
17
- ### Why This Works
18
-
19
- - Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
20
- - Minimizes information loss. You never cut in the middle of a thought.
21
- - Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
22
-
23
- ### Pattern Discovery Shortcut
24
-
25
- Before building a full parser, explore several sample documents for structural patterns:
26
- - Do all chapter titles start with "Chapter X" or "第X章"?
27
- - Are sections numbered consistently (1.1, 1.2, 1.3)?
28
- - Are there visual markers (bold text, specific fonts, horizontal rules)?
29
-
30
- If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
31
- - `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
32
- - `^Chapter \d+` for English chapter headers
33
- - `^\d+\.\d+` for numbered sections
34
-
35
- Always validate the regex against multiple documents before committing to it.
36
-
37
- ## Wedge Driving (Fallback Strategy)
38
-
39
- For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
40
-
41
- ### How It Works
42
-
43
- The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
44
-
45
- **Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
46
-
47
- **Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
48
- - `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
49
- - `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
50
- - `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
51
-
52
- Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
53
-
54
- **Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
55
- 1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
56
- 2. The cut position is immediately after the matched `tokens_before` region.
57
- 3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
58
-
59
- **Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
60
-
61
- ### Why This Works
62
-
63
- - The LLM identifies semantic boundaries, not arbitrary character counts.
64
- - The LLM never regenerates text — it only quotes positions. No hallucination risk.
65
- - K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
66
- - The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
67
- - Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
68
-
69
- ### When to Use
70
-
71
- - Only when the onion peeler cannot split further (no sub-headers available).
72
- - For documents with no structural markup at all.
73
- - Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
74
-
75
- ## Practical Guidelines
76
-
77
- - **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
78
- - **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
79
- - **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
80
- - **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
8
+ This stub remains for legacy references; new content goes to
9
+ `document-chunking`.
@@ -43,9 +43,9 @@ When grouping, name the file with the explicit range so downstream consumers (wo
43
43
 
44
44
  ### Anti-pattern: the unified runner
45
45
 
46
- If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all 110 rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
46
+ If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
47
47
 
48
- E2E #4 demonstrated the cost: an agent wrote `unified_qc.py` to bypass 110 individual skills it didn't trust. Result was 1,150 errors out of 6,930 production checks (16.6%) and a phase counter stuck in `production_qc` while real work happened in skill_authoring. The unified runner felt productive locally and was a global mistake.
48
+ A failure pattern worth flagging: an agent writes a unified runner like `unified_qc.py` to bypass individual skills it doesn't trust. The result is cascading errors a single rule's failure corrupts every other rule's verdict, easily producing 15%+ error rates across thousands of production checks. The unified runner feels productive locally and is a global mistake. It also stalls the phase model: with no individual `check.py` files landing on disk, the engine can't credit the work toward milestone completion.
49
49
 
50
50
  If individual skills aren't running cleanly, the right response is to identify which ones break and fix them, not consolidate. The whole pipeline (extraction → skill_testing → distillation → production_qc) assumes one rule = one verifiable artifact.
51
51
 
@@ -53,20 +53,22 @@ If individual skills aren't running cleanly, the right response is to identify w
53
53
 
54
54
  Each rule_skill folder MUST have BOTH a substantive `SKILL.md` AND a substantive `check.py` (or `check.py` that imports + calls a workflow that does the real work). One side being a stub breaks the contract.
55
55
 
56
- **Variant 1 (v0.7.5 贷款 audit § 9.1)**: stub `SKILL.md` (templated 19 lines with `检查逻辑: N/A`) paired with real `check.py` (44-131 LOC of regex methodology). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The agent put all the methodology into `check.py` comments, which works for the engine but loses the deliverable framing.
56
+ **Variant 1**: stub `SKILL.md` (a templated ~20-line scaffold with `检查逻辑: N/A` or equivalent) paired with a real `check.py` (regex methodology embedded in code). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The methodology has been pushed entirely into `check.py` comments works for the engine, loses the deliverable framing.
57
57
 
58
- **Variant 2 (v0.7.5 资管 audit § 3.4)**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, regulation cross-refs) paired with stub `check.py` (29-line scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
58
+ **Variant 2**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, source cross-refs) paired with stub `check.py` (a thin scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
59
59
 
60
- **Variant 3 (legacy v0.7.0)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with otherwise-real SKILL.md. Methodology described but never executable.
60
+ **Variant 3 (legacy)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with an otherwise-real SKILL.md. Methodology described but never executable.
61
+
62
+ **Variant 4 (the "monolithic verify engine" stub)**: per-rule SKILL.md is a thin 20-35 line scaffold ("see verify_engine.py"), per-rule check.py is a thin shim that imports + calls a single monolithic `rule_skills/verify_engine.py` (or similar root-level file) that holds all 15-20 rules' verification logic in one ~750-LOC file. Each per-rule check.py looks like `from rule_skills.verify_engine import check_R01_01; return check_R01_01(doc)`. This passes the "check.py is not literally a stub" surface check but inverts the canonical per-rule granularity: per-rule files contain no rule-specific reasoning, the monolith holds everything, and the read-this-skill-to-understand-this-rule workflow fails for everyone (developer user, future auditor, downstream agent). The contract says skills are KC's unit of per-rule granularity for a reason. Centralizing all check logic into one big file may look efficient but loses the per-rule auditability that's the whole point. If you find yourself writing `verify_engine.py` with 15+ check functions and stub SKILL.md/check.py per rule, stop — keep the methodology in each rule's SKILL.md (substantive) and either inline the check logic or use the canonical per-rule check.py + workflow_v1.py pattern.
61
63
 
62
64
  **The contract**:
63
- - ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not 19.
65
+ - ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not a 20-line template.
64
66
  - ✓ DO: check.py implements the check. EITHER substantive direct logic OR `from workflows.<rule_id>.workflow_v1 import verify` + delegate. Returns concrete verdicts.
65
67
  - ✗ DON'T: stub SKILL.md with methodology in check.py comments (variant 1).
66
68
  - ✗ DON'T: substantive SKILL.md with check.py that returns NOT_APPLICABLE without delegating to a workflow (variant 2).
67
69
  - ✗ DON'T: stub check.py returning null verdict (variant 3, legacy).
68
70
 
69
- A future engine milestone check (v0.8 P2-F) may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
71
+ The engine may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
70
72
 
71
73
  ## Writing SKILL.md
72
74
 
@@ -139,7 +141,7 @@ def check(document_text):
139
141
 
140
142
  Recognized prefixes (Chinese + English variants): 预期命中点, 预期结果, 预期判定, 预期验证, 标注, 审核标注, Expected, expected, EXPECTED, Annotation, annotation. Pass `extra_prefixes=("..."、"...")` if your project uses different labels.
141
143
 
142
- E2E #11 贷款 v0.8 audit: 4/14 rules had standalone check.py false-positive PASS on violation samples because they matched the `预期命中点: ...年化利率` footer instead of the document body. v0.8.1 ships the helper as a template file so this trap is one import away from being avoided.
144
+ A recurring failure mode worth flagging: a non-trivial fraction of rules have standalone check.py false-positive PASS on violation samples because the regex matches the `预期命中点: ...` annotation footer instead of the document body. KC ships the helper as a template file so this trap is one import away from being avoided.
143
145
 
144
146
  ## Writing References
145
147
 
@@ -157,6 +159,48 @@ Keep references factual and sourced. They are evidence, not instructions.
157
159
  - **samples.json**: Annotated examples. Each entry: the input (extracted text or entity), the expected result (pass/fail/missing), and the expected comment. Build this incrementally as you test.
158
160
  - **corner_cases.json**: Edge cases that the standard logic does not handle. Each entry: description, detection pattern, resolution, and confidence threshold. See the `corner-case-management` skill for the methodology.
159
161
 
162
+ ## Authoring methodology (from skill-creator core)
163
+
164
+ This section folds in the universal authoring patterns from Anthropic's upstream `skill-creator`. Apply them on top of the KC-specific layout above when drafting any new rule skill.
165
+
166
+ ### Capture intent before drafting
167
+
168
+ Before writing any skill, get clear on four questions:
169
+
170
+ 1. **What should this skill enable Claude (or check.py) to do?** — A single concrete capability, not a category.
171
+ 2. **When should it trigger?** — What user phrases / document contexts should match its description.
172
+ 3. **What's the expected output format?** — verdict + comment + evidence shape; or for non-check skills, the deliverable shape.
173
+ 4. **Do we need test samples?** — If the rule has objective pass/fail criteria (almost all KC rules do), yes. Build `assets/samples.json` incrementally as edge cases appear.
174
+
175
+ If the conversation already contains worked examples (a user pointed at a passage and said "this is non-compliant"), extract answers from history first — don't ask the user to repeat themselves.
176
+
177
+ ### Frontmatter and progressive disclosure
178
+
179
+ Skills load in three tiers; budget each tier for what it has to carry:
180
+
181
+ 1. **Metadata (name + description)** — always in the agent's context. ~100 words. This is the *primary triggering mechanism* — make it specific and slightly "pushy" (Claude tends to under-trigger skills). Include trigger keywords, rule ID, the regulation it derives from, and the document location it expects to find evidence in.
182
+ 2. **SKILL.md body** — loaded when the skill triggers. Target 100–300 lines for typical rules, hard ceiling 500. Explain the WHY behind the rule, not just the mechanics.
183
+ 3. **Bundled resources** (scripts/, references/, assets/) — loaded only when the body explicitly points to them. Big regulation excerpts and sample corpora belong here, not inline.
184
+
185
+ If SKILL.md is approaching 500 lines, that's the cue to push detail down into `references/` and leave a pointer like "See references/edge-cases.md for the full enumeration of corner cases."
186
+
187
+ ### Writing style
188
+
189
+ - **Imperative over passive.** "Extract the ratio" not "the ratio should be extracted."
190
+ - **Explain why, not just what.** Today's LLMs have good theory of mind — a one-line "the regulation flags this to protect retail investors" makes a downstream agent generalize correctly to a case you didn't enumerate.
191
+ - **Be wary of all-caps MUSTs and NEVERs.** If you find yourself reaching for them, that's usually a sign the underlying reasoning hasn't been made explicit. Reframe and explain.
192
+ - **Be specific about location.** "Look in Chapter 2, Section 'Key Regulatory Metrics' or the summary table on page 1" beats "look in the financial disclosures somewhere."
193
+
194
+ ### Test samples before scripts
195
+
196
+ After drafting SKILL.md, write 2–3 realistic sample inputs into `assets/samples.json` with their expected verdicts BEFORE you finalise `check.py`. The samples ground the script: every regex or keyword you add should be there to make a specific sample produce its expected verdict. Writing scripts in the abstract — without samples — almost always produces over-fitted code that fails on the first real document.
197
+
198
+ ### Iterate, don't perfect
199
+
200
+ A rule skill rarely lands correctly on the first draft. Plan for at least one revision after testing surfaces problems. Don't pile on defensive MUSTs to handle every edge case — generalize the methodology. If three samples each needed a different one-off fix, that's a signal the underlying rule statement is too narrow.
201
+
202
+ If your skill needs more sophisticated methodology than this section covers — formal eval loops with quantitative benchmarks, blind A/B comparison between skill versions, or description-optimization runs — consult `skill-creator`.
203
+
160
204
  ## Iteration
161
205
 
162
206
  Skills evolve through testing. After each test iteration:
@@ -1,10 +1,22 @@
1
1
  ---
2
2
  name: skill-creator
3
3
  tier: meta
4
- description: Anthropic's skill-scaffolding toolkit use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, consult `skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
4
+ description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
5
5
  ---
6
6
 
7
- # Skill Creator
7
+ # Skill creator (KC fallback)
8
+
9
+ This is the upstream Anthropic `skill-creator` methodology, vendored into
10
+ KC as a deep-reference for sophisticated skill authoring. KC's primary
11
+ skill-authoring path is the `skill-authoring` skill (which inherits the
12
+ core ideas from this skill). Only consult `skill-creator` directly when
13
+ `skill-authoring`'s guidance feels insufficient for the complexity of the
14
+ skill you need to write — e.g., when you want to run formal eval loops,
15
+ benchmark with variance analysis, or optimise the trigger description.
16
+ For ordering and grouping decisions about which skills to author first,
17
+ see `work-decomposition`.
18
+
19
+ ---
8
20
 
9
21
  A skill for creating new skills and iteratively improving them.
10
22
 
@@ -392,7 +404,7 @@ Use the model ID from your system prompt (the one powering the current session)
392
404
 
393
405
  While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
394
406
 
395
- This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
407
+ This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
396
408
 
397
409
  ### How skill triggering works
398
410
 
@@ -436,6 +448,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
436
448
 
437
449
  **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
438
450
 
451
+ **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
452
+ - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
453
+ - **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
454
+ - **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
455
+
439
456
  ---
440
457
 
441
458
  ## Cowork-Specific Instructions
@@ -448,6 +465,7 @@ If you're in Cowork, the main things to know are:
448
465
  - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
449
466
  - Packaging works — `package_skill.py` just needs Python and a filesystem.
450
467
  - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
468
+ - **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
451
469
 
452
470
  ---
453
471
 
@@ -478,3 +496,7 @@ Repeating one more time the core loop here for emphasis:
478
496
  Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
479
497
 
480
498
  Good luck!
499
+
500
+ ---
501
+
502
+ *Methodology adapted from [anthropics/skills](https://github.com/anthropics/skills) (Apache 2.0).*
@@ -70,7 +70,7 @@ If you do escalate to LLM:
70
70
  - **tier2-3**: bulk extraction with simple semantic checks
71
71
  - **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
72
72
 
73
- Both v0.7.1 audit conductors (DS and GLM) defaulted to all-regex distillation and only added LLM escalation when the human user explicitly asked for "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
73
+ A recurring failure pattern worth flagging: agents default to all-regex distillation and only add LLM escalation when the human user explicitly asks for a "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
74
74
 
75
75
  ## Workflow Structure
76
76
 
@@ -148,6 +148,17 @@ All numbers here (10 documents, 5 percentage points, etc.) are recommended start
148
148
 
149
149
  This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
150
150
 
151
+ ### Picking the model inside a tier — quick reference
152
+
153
+ The tier framework above answers "which tier is right for this step?". Inside a tier slot, "which specific model?" still matters. A few heuristics that hold today (refresh from `auto-model-selection` — specifics change in months, not years):
154
+
155
+ - **Tier 1 / Tier 2 worker workhorse**: the current-generation flagship MoE LLMs (200-400B total / ~20B activated experts) are a reasonable starting baseline. Qwen's family flagship and DeepSeek's current premium model are both in this shape; either works.
156
+ - **Tier 3 / Tier 4 small model**: prefer Qwen family for sub-30B options — many cheap, reliable choices. Skip the `coder` / `code` named variants at small sizes (unreliable for general worker tasks). Prefer no-thinking-mode variants when available; these tasks don't benefit from reflection.
157
+ - **Provider stacking**: routing conductor and worker through different providers can isolate per-model throttle / rate-limit exposure (e.g., DeepSeek for workers, SiliconFlow for conductor).
158
+ - **VLM / OCR**: characters / handwriting / seals → dedicated OCR model (Paddle-OCR, GLM-OCR, DeepSeek-OCR or their successors). Complex graphs / tables → larger general VLM.
159
+
160
+ For up-to-date facts (exact model names, context windows, pricing), consult `auto-model-selection` and use Context7. The heuristics above go stale fast — the *shape* (MoE flagship for workhorse, sub-30B non-thinking for cheap bulk, OCR-specific for chars) is what stays.
161
+
151
162
  ## Testing Against Ground Truth
152
163
 
153
164
  The coding agent's skill-based results are the ground truth. For each document in Samples/:
@@ -190,7 +201,7 @@ See `references/worker-llm-catalog.md` for current model capabilities and contex
190
201
 
191
202
  ## Two access paths: `worker_llm_call` tool (preferred) vs direct HTTP
192
203
 
193
- KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. v0.8 P2-B added a batch mode:
204
+ KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. A batch mode is supported:
194
205
 
195
206
  ```
196
207
  worker_llm_call({
@@ -203,7 +214,7 @@ worker_llm_call({
203
214
 
204
215
  Returns a `{n_total, n_succeeded, n_failed, total_tokens_in, total_tokens_out, results: [...]}` summary. Partial failures don't fail the whole batch.
205
216
 
206
- ### The canonical `workflows/common/llm_client.py` (v0.8.1 — ship from template)
217
+ ### The canonical `workflows/common/llm_client.py` (shipped from template)
207
218
 
208
219
  For a workflow that runs **standalone** (no KC session — e.g., a customer deploys the release bundle and runs `python run.py doc.pdf`), the workflow has no access to `worker_llm_call`. The canonical HTTP client shim ships as a template file and is auto-populated into every workspace's `workflows/common/llm_client.py` at engine init. **Do not write your own.** Use the file that's already there:
209
220
 
@@ -226,7 +237,15 @@ What the shim does:
226
237
  - Writes a line to `output/llm_ledger.jsonl` per call so KC audits can reconstruct cost even when worker_llm_call wasn't used
227
238
  - Raises an explicit error if `LLM_BASE_URL` is missing (no silent fallback to a hardcoded vendor)
228
239
 
229
- **Don't write your own llm_client.py from scratch.** Three v0.7.x/v0.8 sessions in a row had agents roll their own shim — buggy (stale model IDs, hardcoded SiliconFlow URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
240
+ **Don't write your own llm_client.py from scratch.** A recurring failure mode worth flagging: agents repeatedly roll their own shim — buggy (stale model IDs, hardcoded vendor URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
241
+
242
+ ### Worse anti-pattern: writing a parallel hand-rolled client AT THE SAME TIME as the canonical one
243
+
244
+ A persistent failure pattern across runs: the canonical `workflows/common/llm_client.py` is present (engine auto-populated it at init), AND the agent ALSO writes its own `workflows/llm_client.py` or `verify_engine_v2.py` with a `requests.post(...)` HTTP call. The agent then uses the hand-rolled one for all the real LLM work. Both files sit in the workspace. Engine cost-tracking sees nothing.
245
+
246
+ When this happens, three things go wrong: (1) Provider routing breaks. The hand-rolled client typically reads `LLM_BASE_URL` from workspace `.env` — which is the **conductor**'s endpoint. KC's worker routing (via `worker_llm_call` and the engine's worker_* config) gets bypassed entirely. If the operator configured workers to a separate provider (say, DeepSeek workers + SiliconFlow conductor), the hand-rolled client wrongly hits the conductor's provider with the worker's model names — produces 400s or worse, silent wrong-model results. (2) Cost / audit visibility is lost. Engine doesn't see the calls; neither does `output/llm_ledger.jsonl` (the canonical client's ledger isn't written by the hand-rolled one). The session looks like it did zero LLM work, but the actual LLM bill exists. (3) Rate-limit / retry / timeout behavior diverges. The canonical client + worker_llm_call inherit engine-level resilience patterns (AbortSignal.timeout, withRetry on 429/5xx, etc.). A hand-rolled `requests.post` has none of that — it stalls, throws ad-hoc errors, or silently corrupts the run.
247
+
248
+ **Rule of thumb**: if the verification rule needs an LLM judgment, your two valid options are `worker_llm_call` (when running inside a KC session) or `from workflows.common.llm_client import call` (when running standalone from a release bundle). If you find yourself typing `import requests` or `urllib.request.urlopen` for an LLM call, stop. That code path will be flagged in audit as a recurring adoption miss and rewritten — save the round trip and use the right tool the first time.
230
249
 
231
250
  ## sandbox_exec timeout for known-slow commands
232
251
 
@@ -126,7 +126,7 @@ Five failure modes recur across projects. Learn to recognize them early.
126
126
 
127
127
  ## Integration
128
128
 
129
- Task decomposition sits between rule extraction and skill authoring in the KC Reborn lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
129
+ Task decomposition sits between rule extraction and skill authoring in the KC lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
130
130
 
131
131
  **Input**: A rule catalog from `rule-extraction`. Each rule is an atomic, testable verification requirement. If a rule is not yet atomic, send it back to rule extraction for further decomposition before attempting task decomposition.
132
132