npm - kc-beta - Versions diffs - 0.8.1 → 0.8.3 - Mend

kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

package/template/skills/en/rule-extraction/SKILL.md CHANGED Viewed

@@ -16,6 +16,12 @@ Data/entity extraction (`entity-extraction`) is the **repeating task** that runs
 Don't conflate the two. Rule extraction happens once; data extraction happens on every document.
+## Source-first sequencing
+Extract rules from the source text FIRST. Only after you have a complete first-pass catalog from sources alone should you open sample documents. The temptation is to peek at samples early to "see what kinds of rules matter" — this biases you toward rules the samples happen to exercise and silently drops rules the samples don't cover.
+A domain professional reads the source material, builds an understanding, then validates on samples — not the reverse. KC's differentiator over general-purpose agents is systematic accuracy across long context; that advantage compounds when you ground in the SOURCE not the EXAMPLES.
 ## Rule Structure: Location → Extraction → Judgment
 Every verification rule decomposes into three parts:
@@ -62,22 +68,17 @@ When rules change (additions, modifications, deprecations), version the entire r
 ## Granularity Calibration (read before extracting)
-A well-extracted rule catalog has **10-20 rules per typical regulation PDF**
-(a 30-80 page disclosure regulation). Over-extraction into 60-100 rules per
-regulation signals you're treating every clause as its own rule — downstream
-consumers (skill-authoring, workflow-run) can't distinguish meaningful
-checks from boilerplate.
-If your first pass produces more than ~25 rules for a single regulation:
-- **Merge rules that share evidence and fail together** (e.g., "must
-  disclose X" and "must disclose Y" where both come from the same
-  required-fields table → one rule: "must disclose the required-fields
-  list including X, Y").
-- **Drop procedural language** that isn't checkable against a report
-  (definitions, scope statements, references to other regs that just
-  transitively apply).
-- **Keep only checkable obligations, prohibitions, and thresholds** —
-  things where you can read a sample report and say pass or fail.
+Rule catalogs come from diverse source materials — formal regulations, internal handbooks, case law, legal opinions, expert rule tables, regulator Q&A. There is no universal "right number of rules per page". Calibrate by logic, not by count:
+- **Atomicity is the real test.** A rule that can produce two independent pass/fail outcomes is two rules. A rule whose verdict requires verifying three different paragraphs of the source is probably three rules.
+- **Boilerplate is not a rule.** Definitions, scope statements, transitive references to other regulations, and procedural language that can't be checked against the target document do not become rules.
+- **Keep only checkable obligations, prohibitions, and thresholds** — things where you can read a target document and say pass / fail / not-applicable.
+If your first pass feels too coarse (one rule per chapter, ignoring multiple distinct obligations within) — go finer. If it feels too fine (every clause in a definitions section is its own rule) — merge or drop. Then:
+- **Merge rules that share evidence and fail together** (e.g., "must disclose X" and "must disclose Y" where both come from the same required-fields table → one rule: "must disclose the required-fields list including X, Y").
+- **Drop procedural language** that isn't checkable against a target document.
+- **Convert each surviving rule into a falsifiability statement** — if you can't state precisely what would make it fail, you don't have a rule yet.
 ### Sample "good" rule
@@ -94,104 +95,58 @@ If your first pass produces more than ~25 rules for a single regulation:
 }
 ```
-Note: one pass/fail outcome, a single `source_ref` to a specific clause,
-clear applicability scope. Skill-authoring can write `check_r014.py` from
-this alone.
+Note: one pass/fail outcome, a single `source_ref` to a specific clause, clear applicability scope. Skill-authoring can write `check_r014.py` from this alone.
-### Cross-regulation dedup (when working across multiple PDFs)
+### Cross-source dedup (when working across multiple documents)
-If the developer user provides N regulations, rules from later regs often
-duplicate cross-cutting requirements already captured by earlier ones
-(e.g., a 2018 generic disclosure rule vs. a 2025 specific version).
-Before emitting a rule from reg N:
+If the developer user provides N source documents, rules from later sources often duplicate cross-cutting requirements already captured by earlier ones (e.g., a generic disclosure rule from an older regulation vs. a newer specific version of the same obligation). Before emitting a rule from source N:
-1. **Check the existing catalog.** Use `rule_catalog` (operation: list)
-   to see what's already there. Skip if a rule with equivalent scope +
-   intent exists.
+1. **Check the existing catalog.** Use `rule_catalog` (operation: list) to see what's already there. Skip if a rule with equivalent scope + intent exists.
 2. **Prefer the newer / more specific source_ref** when rules overlap.
-3. **If you merged rules**, record the consolidated sources in
-   `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
+3. **If you merged rules**, record the consolidated sources in `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
 ### Delegation to sub-agents
-If you dispatch extraction to sub-agents (one per regulation), the
-sub-agent inherits ONLY its `task_description` — it cannot see your
-conversation or existing catalog. Therefore, when composing the brief:
-- **Specify the target count band** explicitly: "Extract 10-20 atomic
-  rules from this regulation."
-- **Include a sample rule** in the brief body (paste the JSON above
-  verbatim) so the sub-agent's calibration matches yours.
-- **Name every regulation the sub-agent should process.** If AGENT.md
-  lists 10 core regulations, the brief must list all 10 by name, not
-  "the core regs" as a pronoun — LLMs composing long structured briefs
-  frequently drop items (observed in session 6304673afaa0 where reg 02
-  was silently omitted).
-- **State the dedup contract**: "Rules already in the parent's catalog
-  (R001–Rnnn) should NOT be re-extracted. If a requirement is already
-  covered, skip it." Then pass the current catalog's ID ranges.
-- **Prefer `rule_catalog` create operations over sandbox_exec writes to
-  catalog.json.** rule_catalog uses workspace file locking;
-  sandbox_exec bypasses it and races with other writers.
-## How to read regulation files (default: read whole)
-Regulations are the audit's authoritative basis. Every `source_ref`
-in your extracted rules must be verifiable against the source text.
-For typical regulation documents (a single file under ~50 KB / under
-~100 pages), **read each regulation file whole using `workspace_file`
-(operation=read) in a single call**:
+If you dispatch extraction to sub-agents (one per source document), the sub-agent inherits ONLY its `task_description` — it cannot see your conversation or existing catalog. Therefore, when composing the brief:
+- **Anchor calibration with a concrete sample rule.** Paste the JSON above verbatim into the brief body so the sub-agent's atomicity calibration matches yours.
+- **Name every source document the sub-agent should process.** If AGENT.md lists 10 core source documents, the brief must list all 10 by name, not "the core regs" as a pronoun — LLMs composing long structured briefs frequently drop items silently.
+- **State the dedup contract**: "Rules already in the parent's catalog (R001–Rnnn) should NOT be re-extracted. If a requirement is already covered, skip it." Then pass the current catalog's ID ranges.
+- **Prefer `rule_catalog` create operations over sandbox_exec writes to catalog.json.** rule_catalog uses workspace file locking; sandbox_exec bypasses it and races with other writers.
+## How to read source files (default: read whole)
+Source documents are the catalog's authoritative basis. Every `source_ref` in your extracted rules must be verifiable against the source text. For typical source documents (a single file under ~50 KB / under ~100 pages), **read each source file whole using `workspace_file` (operation=read) in a single call**:
 ```js
-workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_regulation.md" })
+workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_source.md" })
 ```
-`workspace_file.read` is capped at 50,000 chars per call, which
-covers virtually every individual regulation document. This is the
-default. **Read every regulation file whole before you start
-extracting rules from any of them.**
+`workspace_file.read` is capped at 50,000 chars per call, which covers virtually every individual source document. This is the default. **Read every source file whole before you start extracting rules from any of them.**
 ### Tool choice — `workspace_file` vs `sandbox_exec`
 | Tool | Per-call cap | Use for |
 |---|---:|---|
-| `workspace_file` (read) | 50,000 chars | **full reads of regulation / rule documents** |
+| `workspace_file` (read) | 50,000 chars | **full reads of source / rule documents** |
 | `sandbox_exec` (cat/head/etc) | 10,000 chars | shell commands, **not** full file reads |
-`sandbox_exec` is designed for shell commands; its 10K cap is too
-small for most regulations. `cat rules/01_*.md` returns only the
-first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` /
-`tail -M` to scroll the window loses positional precision and burns
-turns. **When you see truncation, don't fight the cap — switch
-tools.**
+`sandbox_exec` is designed for shell commands; its 10K cap is too small for most regulations. `cat rules/01_*.md` returns only the first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` / `tail -M` to scroll the window loses positional precision and burns turns. **When you see truncation, don't fight the cap — switch tools.**
-### Asymmetry — regs read whole, samples sampled
+### Asymmetry — sources read whole, samples sampled
-Regulations are limited (typically 1-10 files), authoritative, and
-read once. Read every regulation whole.
+Source documents are limited (typically 1-10 files), authoritative, and read once. Read every source file whole.
-Sample documents may number 30 to 1000+, are heterogeneous, and get
-read many times during testing. **Don't try to read every sample
-whole.** Use rule-applicability filters or sampled subsets to focus
-attention.
+Sample documents may number 30 to 1000+, are heterogeneous, and get read many times during testing. **Don't try to read every sample whole.** Use rule-applicability filters or sampled subsets to focus attention.
-### Escape valve — when a single reg exceeds ~200K chars
+### Escape valve — when a single source exceeds ~200K chars
-Rare in practice. The largest regulation in `test_data_4` is 42 KB;
-typical Chinese banking regs (资管新规, 信披办法, etc.) all fit
-under 50 KB. But if you do encounter a single regulation so large
-that reading it whole would crowd the context window — heuristic:
-the file exceeds ~200,000 chars or ~25% of your context budget —
-use your own judgment:
+Rare in practice — most regulation, handbook, or rule-table documents fit comfortably under 50 KB. But if you do encounter a single source document so large that reading it whole would crowd the context window — heuristic: the file exceeds ~200,000 chars or ~25% of your context budget — use your own judgment:
-- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse`
-  or paginated `workspace_file` reads
-- Or build an in-workspace index file pointing to chapter offsets and
-  read on-demand per rule being extracted
+- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse` or paginated `workspace_file` reads
+- Or build an in-workspace index file pointing to chapter offsets and read on-demand per rule being extracted
-The 50 KB cap is high enough that this almost never triggers. **The
-default is read whole; deviate only when the file genuinely doesn't
-fit.**
+The 50 KB cap is high enough that this almost never triggers. **The default is read whole; deviate only when the file genuinely doesn't fit.**
 ## Extraction Strategies
@@ -202,11 +157,14 @@ When the developer user provides rules in xlsx, csv, or a structured document wh
 - Map each row to a rule, preserving the developer user's identifiers.
 - Ask clarifying questions only if entries are ambiguous.
-### Strategy 2: Hierarchical Extraction from Regulation Text
+### Strategy 2: Hierarchical Extraction from Source Text
-For raw regulation documents (PDF, DOCX, legal text):
+For raw source documents (PDF, DOCX, legal text, handbooks, case collections):
 1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
+   Before extracting any rule, traverse the table of contents and section headers end-to-end. Sketch the rule-bearing hierarchy: which chapters impose obligations, which are definitions / context. A common failure mode: a long source with many articles yields disproportionately few rules — almost always meaning you stopped surveying after the high-density chapters. Decide your rule-bearing chapter span explicitly, then justify deviations relative to that span rather than to a single global count target.
 2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
 3. **Peel the onion.** Start at the highest structural level and work downward:
    - Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
@@ -216,7 +174,7 @@ For raw regulation documents (PDF, DOCX, legal text):
 4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
 5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
-For long documents (100+ pages), use the onion-peeler approach described in `references/chunking-strategies.md`. Do not try to read the entire document in one pass.
+For long documents, use the onion-peeler approach — see the `document-chunking` skill for the full strategy and the wedge-driving fallback for sections without clear headers. Do not try to read the entire document in one pass.
 ### Strategy 3: Expert Notes
@@ -285,6 +243,8 @@ Do not skip ambiguous rules. They are often the most important ones.
 ## Sanity-check applicability against the sample corpus
+> This is a validation pass, not a discovery pass. Do not let 0-sample rules tempt you to delete them at this stage — first ask whether the source requires them; if yes, keep them as "future scope" rather than drop.
 After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
 For every rule:
@@ -292,14 +252,52 @@ For every rule:
 2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
 3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
-E2E #7 GLM produced a 97-rule catalog where 36 rules (37%) had `PASS=0 FAIL=0 NOT_APPLICABLE=90` across all 90 documents — they never fired. Some were legit (rules for cash-management products with no cash-management samples in corpus), but 36 inactive of 97 was high enough to suggest scope-too-narrow drift.
+A failure mode worth flagging: a catalog where a large fraction of rules (say 30-40%) return `PASS=0 FAIL=0 NOT_APPLICABLE=all` across the entire sample set. Some inactive rules are legitimate (the source requires checks for a product type the corpus doesn't happen to contain), but a high inactive ratio almost always signals scope-too-narrow drift — applicability filters that over-specify.
 If many rules are 0-sample, either:
 - **Reframe their applicability** — broaden product types, look for evidence in headers/footers not just body, relax the scope filter
 - **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
 - **Update the test corpus** to include matching samples (work with the developer user)
-Catching this in `rule_extraction` is much cheaper than authoring 36 skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
+Catching this in `rule_extraction` is much cheaper than authoring N skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
+## Logic-type taxonomy (coverage diagnostic)
+After first-pass extraction, classify each rule by judgment type:
+- **Threshold** — numeric comparison ("annualized rate ≥ 15.4%")
+- **Decision-Tree** — multi-branch ("if product type ∈ {A, B} then ...")
+- **Heuristic** — semantic judgment ("does marketing copy imply principal guarantee")
+- **Process** — procedural compliance ("published within the required deadline")
+If your catalog is 90% Threshold rules, you have likely missed the semantic / process obligations that don't reduce to a number. Re-survey for those. The four types are roughly comparable in frequency across most rule corpora; a heavy skew is a signal to look again at the chapters or sections you skimmed.
+## Preserve specifics (anti-summarize)
+When writing a rule's `description` and `falsifiability_statement`, preserve every threshold, percentage, deadline, and named entity from the source. "Disclose within a reasonable period" is a vague rule and will fail downstream — the source almost certainly says "within 15 business days." If the source IS genuinely vague, flag the ambiguity explicitly (e.g., `notes: "source uses '及时'; no numeric deadline"`) rather than smoothing it. Downstream skill-authoring will need the specifics to write check.py logic.
+## Soft sample-access discipline
+You have unlimited tool access to samples — KC does not cap you. The discipline is procedural: source-extraction phase first, then validation phase. Inside source-extraction phase, samples are a last-resort reference for clarifying terminology, not a discovery surface. If you find yourself opening sample N° 3 to figure out what to extract next, you have inverted the methodology — close the sample, return to the source. Acceptable narrow exceptions:
+- A jargon term in the source needs example resolution
+- Sanity-check that a rule's `description` field reads coherently when applied to a real document
+## Primary vs auxiliary sources — iteration order, NOT coverage breadth
+When the developer user labels some source documents "primary" and others "auxiliary" (or "supplementary", or "secondary"), that distinction is about **iteration order**: do the primary regs deeply first, then come back to the auxiliary ones. It is **NOT** a license to skip the auxiliary regs entirely.
+A recurring failure mode worth flagging: agent reads "primary 01-02 are the main basis, the rest is auxiliary" and produces 13 rules from regs 01-02 + 2 rules from regs 03-04 + zero rules from regs 05-10. The auxiliary regulations (often 60-90 articles each in compliance domains) almost always contain core obligations the primary regs reference or assume. Extracting nothing from them produces a thin catalog that misses real compliance requirements.
+The right interpretation: primary regs get the first deep pass, the auxiliary regs get a structural-survey pass at minimum — identify their core obligations and extract those, even if not at the same density as primary. Skipping a 80-article regulation entirely should require an explicit reason in `coverage_audit.md` (e.g., "regulation 05 covers fund operations outside our case scope; explicitly out-of-scope per user discussion"). Silent skipping is the failure mode.
+## Coverage trace (recommended deliverable)
+After extraction, walk the source document paragraph-by-paragraph and tag each as either:
+- `covered_by: [Rxxx, Ryyy]` — articles whose obligations became one or more rules
+- `non_checkable: definition | context | cross_ref | scope` — articles excluded with explicit reason
+Write this as `rules/coverage_trace.md` (or a section in `coverage_audit.md`). This is the source-side mirror of the existing sample-side applicability check, and catches the "long source → suspiciously few rules" failure mode directly. Engine derivation can read this trace to validate completeness later.
 ## When Rules Change

package/template/skills/en/rule-extraction/references/chunking-strategies.md CHANGED Viewed

@@ -1,80 +1,9 @@
-# Chunking Strategies for Long Documents
+# Chunking Strategies
-When regulation documents exceed what you can process in a single pass, use these proven strategies to decompose them into manageable chunks while preserving semantic coherence.
+The chunking methodology (onion-peeler + wedge fallback + balance
+heuristics) now lives in the `document-chunking` skill. Consult that
+skill directly when designing chunking for rule extraction or any
+downstream processing.
-## The Onion Peeler (Primary Strategy)
-Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
-### How It Works
-1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
-2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
-3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
-4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
-5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
-### Why This Works
-- Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
-- Minimizes information loss. You never cut in the middle of a thought.
-- Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
-### Pattern Discovery Shortcut
-Before building a full parser, explore several sample documents for structural patterns:
-- Do all chapter titles start with "Chapter X" or "第X章"?
-- Are sections numbered consistently (1.1, 1.2, 1.3)?
-- Are there visual markers (bold text, specific fonts, horizontal rules)?
-If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
-- `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
-- `^Chapter \d+` for English chapter headers
-- `^\d+\.\d+` for numbered sections
-Always validate the regex against multiple documents before committing to it.
-## Wedge Driving (Fallback Strategy)
-For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
-### How It Works
-The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
-**Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
-**Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
-- `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
-- `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
-- `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
-Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
-**Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
-1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
-2. The cut position is immediately after the matched `tokens_before` region.
-3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
-**Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
-### Why This Works
-- The LLM identifies semantic boundaries, not arbitrary character counts.
-- The LLM never regenerates text — it only quotes positions. No hallucination risk.
-- K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
-- The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
-- Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
-### When to Use
-- Only when the onion peeler cannot split further (no sub-headers available).
-- For documents with no structural markup at all.
-- Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
-## Practical Guidelines
-- **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
-- **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
-- **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
-- **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
+This stub remains for legacy references; new content goes to
+`document-chunking`.

package/template/skills/en/skill-authoring/SKILL.md CHANGED Viewed

@@ -43,9 +43,9 @@ When grouping, name the file with the explicit range so downstream consumers (wo
 ### Anti-pattern: the unified runner
-If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all 110 rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
+If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
-E2E #4 demonstrated the cost: an agent wrote `unified_qc.py` to bypass 110 individual skills it didn't trust. Result was 1,150 errors out of 6,930 production checks (16.6%) and a phase counter stuck in `production_qc` while real work happened in skill_authoring. The unified runner felt productive locally and was a global mistake.
+A failure pattern worth flagging: an agent writes a unified runner like `unified_qc.py` to bypass individual skills it doesn't trust. The result is cascading errors — a single rule's failure corrupts every other rule's verdict, easily producing 15%+ error rates across thousands of production checks. The unified runner feels productive locally and is a global mistake. It also stalls the phase model: with no individual `check.py` files landing on disk, the engine can't credit the work toward milestone completion.
 If individual skills aren't running cleanly, the right response is to identify which ones break and fix them, not consolidate. The whole pipeline (extraction → skill_testing → distillation → production_qc) assumes one rule = one verifiable artifact.
@@ -53,20 +53,22 @@ If individual skills aren't running cleanly, the right response is to identify w
 Each rule_skill folder MUST have BOTH a substantive `SKILL.md` AND a substantive `check.py` (or `check.py` that imports + calls a workflow that does the real work). One side being a stub breaks the contract.
-**Variant 1 (v0.7.5 贷款 audit § 9.1)**: stub `SKILL.md` (templated 19 lines with `检查逻辑: N/A`) paired with real `check.py` (44-131 LOC of regex methodology). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The agent put all the methodology into `check.py` comments, which works for the engine but loses the deliverable framing.
+**Variant 1**: stub `SKILL.md` (a templated ~20-line scaffold with `检查逻辑: N/A` or equivalent) paired with a real `check.py` (regex methodology embedded in code). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The methodology has been pushed entirely into `check.py` comments — works for the engine, loses the deliverable framing.
-**Variant 2 (v0.7.5 资管 audit § 3.4)**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, regulation cross-refs) paired with stub `check.py` (29-line scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
+**Variant 2**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, source cross-refs) paired with stub `check.py` (a thin scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
-**Variant 3 (legacy v0.7.0)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with otherwise-real SKILL.md. Methodology described but never executable.
+**Variant 3 (legacy)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with an otherwise-real SKILL.md. Methodology described but never executable.
+**Variant 4 (the "monolithic verify engine" stub)**: per-rule SKILL.md is a thin 20-35 line scaffold ("see verify_engine.py"), per-rule check.py is a thin shim that imports + calls a single monolithic `rule_skills/verify_engine.py` (or similar root-level file) that holds all 15-20 rules' verification logic in one ~750-LOC file. Each per-rule check.py looks like `from rule_skills.verify_engine import check_R01_01; return check_R01_01(doc)`. This passes the "check.py is not literally a stub" surface check but inverts the canonical per-rule granularity: per-rule files contain no rule-specific reasoning, the monolith holds everything, and the read-this-skill-to-understand-this-rule workflow fails for everyone (developer user, future auditor, downstream agent). The contract says skills are KC's unit of per-rule granularity for a reason. Centralizing all check logic into one big file may look efficient but loses the per-rule auditability that's the whole point. If you find yourself writing `verify_engine.py` with 15+ check functions and stub SKILL.md/check.py per rule, stop — keep the methodology in each rule's SKILL.md (substantive) and either inline the check logic or use the canonical per-rule check.py + workflow_v1.py pattern.
 **The contract**:
-- ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not 19.
+- ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not a 20-line template.
 - ✓ DO: check.py implements the check. EITHER substantive direct logic OR `from workflows.<rule_id>.workflow_v1 import verify` + delegate. Returns concrete verdicts.
 - ✗ DON'T: stub SKILL.md with methodology in check.py comments (variant 1).
 - ✗ DON'T: substantive SKILL.md with check.py that returns NOT_APPLICABLE without delegating to a workflow (variant 2).
 - ✗ DON'T: stub check.py returning null verdict (variant 3, legacy).
-A future engine milestone check (v0.8 P2-F) may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
+The engine may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
 ## Writing SKILL.md
@@ -139,7 +141,7 @@ def check(document_text):
 Recognized prefixes (Chinese + English variants): 预期命中点, 预期结果, 预期判定, 预期验证, 标注, 审核标注, Expected, expected, EXPECTED, Annotation, annotation. Pass `extra_prefixes=("..."、"...")` if your project uses different labels.
-E2E #11 贷款 v0.8 audit: 4/14 rules had standalone check.py false-positive PASS on violation samples because they matched the `预期命中点: ...年化利率` footer instead of the document body. v0.8.1 ships the helper as a template file so this trap is one import away from being avoided.
+A recurring failure mode worth flagging: a non-trivial fraction of rules have standalone check.py false-positive PASS on violation samples because the regex matches the `预期命中点: ...` annotation footer instead of the document body. KC ships the helper as a template file so this trap is one import away from being avoided.
 ## Writing References
@@ -157,6 +159,48 @@ Keep references factual and sourced. They are evidence, not instructions.
 - **samples.json**: Annotated examples. Each entry: the input (extracted text or entity), the expected result (pass/fail/missing), and the expected comment. Build this incrementally as you test.
 - **corner_cases.json**: Edge cases that the standard logic does not handle. Each entry: description, detection pattern, resolution, and confidence threshold. See the `corner-case-management` skill for the methodology.
+## Authoring methodology (from skill-creator core)
+This section folds in the universal authoring patterns from Anthropic's upstream `skill-creator`. Apply them on top of the KC-specific layout above when drafting any new rule skill.
+### Capture intent before drafting
+Before writing any skill, get clear on four questions:
+1. **What should this skill enable Claude (or check.py) to do?** — A single concrete capability, not a category.
+2. **When should it trigger?** — What user phrases / document contexts should match its description.
+3. **What's the expected output format?** — verdict + comment + evidence shape; or for non-check skills, the deliverable shape.
+4. **Do we need test samples?** — If the rule has objective pass/fail criteria (almost all KC rules do), yes. Build `assets/samples.json` incrementally as edge cases appear.
+If the conversation already contains worked examples (a user pointed at a passage and said "this is non-compliant"), extract answers from history first — don't ask the user to repeat themselves.
+### Frontmatter and progressive disclosure
+Skills load in three tiers; budget each tier for what it has to carry:
+1. **Metadata (name + description)** — always in the agent's context. ~100 words. This is the *primary triggering mechanism* — make it specific and slightly "pushy" (Claude tends to under-trigger skills). Include trigger keywords, rule ID, the regulation it derives from, and the document location it expects to find evidence in.
+2. **SKILL.md body** — loaded when the skill triggers. Target 100–300 lines for typical rules, hard ceiling 500. Explain the WHY behind the rule, not just the mechanics.
+3. **Bundled resources** (scripts/, references/, assets/) — loaded only when the body explicitly points to them. Big regulation excerpts and sample corpora belong here, not inline.
+If SKILL.md is approaching 500 lines, that's the cue to push detail down into `references/` and leave a pointer like "See references/edge-cases.md for the full enumeration of corner cases."
+### Writing style
+- **Imperative over passive.** "Extract the ratio" not "the ratio should be extracted."
+- **Explain why, not just what.** Today's LLMs have good theory of mind — a one-line "the regulation flags this to protect retail investors" makes a downstream agent generalize correctly to a case you didn't enumerate.
+- **Be wary of all-caps MUSTs and NEVERs.** If you find yourself reaching for them, that's usually a sign the underlying reasoning hasn't been made explicit. Reframe and explain.
+- **Be specific about location.** "Look in Chapter 2, Section 'Key Regulatory Metrics' or the summary table on page 1" beats "look in the financial disclosures somewhere."
+### Test samples before scripts
+After drafting SKILL.md, write 2–3 realistic sample inputs into `assets/samples.json` with their expected verdicts BEFORE you finalise `check.py`. The samples ground the script: every regex or keyword you add should be there to make a specific sample produce its expected verdict. Writing scripts in the abstract — without samples — almost always produces over-fitted code that fails on the first real document.
+### Iterate, don't perfect
+A rule skill rarely lands correctly on the first draft. Plan for at least one revision after testing surfaces problems. Don't pile on defensive MUSTs to handle every edge case — generalize the methodology. If three samples each needed a different one-off fix, that's a signal the underlying rule statement is too narrow.
+If your skill needs more sophisticated methodology than this section covers — formal eval loops with quantitative benchmarks, blind A/B comparison between skill versions, or description-optimization runs — consult `skill-creator`.
 ## Iteration
 Skills evolve through testing. After each test iteration:

package/template/skills/en/skill-creator/SKILL.md CHANGED Viewed

@@ -1,10 +1,22 @@
 ---
 name: skill-creator
 tier: meta
-description: Anthropic's skill-scaffolding toolkit — use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, consult `skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
+description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
 ---
-# Skill Creator
+# Skill creator (KC fallback)
+This is the upstream Anthropic `skill-creator` methodology, vendored into
+KC as a deep-reference for sophisticated skill authoring. KC's primary
+skill-authoring path is the `skill-authoring` skill (which inherits the
+core ideas from this skill). Only consult `skill-creator` directly when
+`skill-authoring`'s guidance feels insufficient for the complexity of the
+skill you need to write — e.g., when you want to run formal eval loops,
+benchmark with variance analysis, or optimise the trigger description.
+For ordering and grouping decisions about which skills to author first,
+see `work-decomposition`.
+---
 A skill for creating new skills and iteratively improving them.
@@ -392,7 +404,7 @@ Use the model ID from your system prompt (the one powering the current session)
 While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
-This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
+This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
 ### How skill triggering works
@@ -436,6 +448,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
 **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
+**Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
+- **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
+- **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
+- **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
 ---
 ## Cowork-Specific Instructions
@@ -448,6 +465,7 @@ If you're in Cowork, the main things to know are:
 - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
 - Packaging works — `package_skill.py` just needs Python and a filesystem.
 - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
+- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
 ---
@@ -478,3 +496,7 @@ Repeating one more time the core loop here for emphasis:
 Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
 Good luck!
+---
+*Methodology adapted from [anthropics/skills](https://github.com/anthropics/skills) (Apache 2.0).*

package/template/skills/en/skill-to-workflow/SKILL.md CHANGED Viewed

@@ -70,7 +70,7 @@ If you do escalate to LLM:
 - **tier2-3**: bulk extraction with simple semantic checks
 - **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
-Both v0.7.1 audit conductors (DS and GLM) defaulted to all-regex distillation and only added LLM escalation when the human user explicitly asked for "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
+A recurring failure pattern worth flagging: agents default to all-regex distillation and only add LLM escalation when the human user explicitly asks for a "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
 ## Workflow Structure
@@ -148,6 +148,17 @@ All numbers here (10 documents, 5 percentage points, etc.) are recommended start
 This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
+### Picking the model inside a tier — quick reference
+The tier framework above answers "which tier is right for this step?". Inside a tier slot, "which specific model?" still matters. A few heuristics that hold today (refresh from `auto-model-selection` — specifics change in months, not years):
+- **Tier 1 / Tier 2 worker workhorse**: the current-generation flagship MoE LLMs (200-400B total / ~20B activated experts) are a reasonable starting baseline. Qwen's family flagship and DeepSeek's current premium model are both in this shape; either works.
+- **Tier 3 / Tier 4 small model**: prefer Qwen family for sub-30B options — many cheap, reliable choices. Skip the `coder` / `code` named variants at small sizes (unreliable for general worker tasks). Prefer no-thinking-mode variants when available; these tasks don't benefit from reflection.
+- **Provider stacking**: routing conductor and worker through different providers can isolate per-model throttle / rate-limit exposure (e.g., DeepSeek for workers, SiliconFlow for conductor).
+- **VLM / OCR**: characters / handwriting / seals → dedicated OCR model (Paddle-OCR, GLM-OCR, DeepSeek-OCR or their successors). Complex graphs / tables → larger general VLM.
+For up-to-date facts (exact model names, context windows, pricing), consult `auto-model-selection` and use Context7. The heuristics above go stale fast — the *shape* (MoE flagship for workhorse, sub-30B non-thinking for cheap bulk, OCR-specific for chars) is what stays.
 ## Testing Against Ground Truth
 The coding agent's skill-based results are the ground truth. For each document in Samples/:
@@ -190,7 +201,7 @@ See `references/worker-llm-catalog.md` for current model capabilities and contex
 ## Two access paths: `worker_llm_call` tool (preferred) vs direct HTTP
-KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. v0.8 P2-B added a batch mode:
+KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. A batch mode is supported:
 ```
 worker_llm_call({
@@ -203,7 +214,7 @@ worker_llm_call({
 Returns a `{n_total, n_succeeded, n_failed, total_tokens_in, total_tokens_out, results: [...]}` summary. Partial failures don't fail the whole batch.
-### The canonical `workflows/common/llm_client.py` (v0.8.1 — ship from template)
+### The canonical `workflows/common/llm_client.py` (shipped from template)
 For a workflow that runs **standalone** (no KC session — e.g., a customer deploys the release bundle and runs `python run.py doc.pdf`), the workflow has no access to `worker_llm_call`. The canonical HTTP client shim ships as a template file and is auto-populated into every workspace's `workflows/common/llm_client.py` at engine init. **Do not write your own.** Use the file that's already there:
@@ -226,7 +237,15 @@ What the shim does:
 - Writes a line to `output/llm_ledger.jsonl` per call so KC audits can reconstruct cost even when worker_llm_call wasn't used
 - Raises an explicit error if `LLM_BASE_URL` is missing (no silent fallback to a hardcoded vendor)
-**Don't write your own llm_client.py from scratch.** Three v0.7.x/v0.8 sessions in a row had agents roll their own shim — buggy (stale model IDs, hardcoded SiliconFlow URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
+**Don't write your own llm_client.py from scratch.** A recurring failure mode worth flagging: agents repeatedly roll their own shim — buggy (stale model IDs, hardcoded vendor URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
+### Worse anti-pattern: writing a parallel hand-rolled client AT THE SAME TIME as the canonical one
+A persistent failure pattern across runs: the canonical `workflows/common/llm_client.py` is present (engine auto-populated it at init), AND the agent ALSO writes its own `workflows/llm_client.py` or `verify_engine_v2.py` with a `requests.post(...)` HTTP call. The agent then uses the hand-rolled one for all the real LLM work. Both files sit in the workspace. Engine cost-tracking sees nothing.
+When this happens, three things go wrong: (1) Provider routing breaks. The hand-rolled client typically reads `LLM_BASE_URL` from workspace `.env` — which is the **conductor**'s endpoint. KC's worker routing (via `worker_llm_call` and the engine's worker_* config) gets bypassed entirely. If the operator configured workers to a separate provider (say, DeepSeek workers + SiliconFlow conductor), the hand-rolled client wrongly hits the conductor's provider with the worker's model names — produces 400s or worse, silent wrong-model results. (2) Cost / audit visibility is lost. Engine doesn't see the calls; neither does `output/llm_ledger.jsonl` (the canonical client's ledger isn't written by the hand-rolled one). The session looks like it did zero LLM work, but the actual LLM bill exists. (3) Rate-limit / retry / timeout behavior diverges. The canonical client + worker_llm_call inherit engine-level resilience patterns (AbortSignal.timeout, withRetry on 429/5xx, etc.). A hand-rolled `requests.post` has none of that — it stalls, throws ad-hoc errors, or silently corrupts the run.
+**Rule of thumb**: if the verification rule needs an LLM judgment, your two valid options are `worker_llm_call` (when running inside a KC session) or `from workflows.common.llm_client import call` (when running standalone from a release bundle). If you find yourself typing `import requests` or `urllib.request.urlopen` for an LLM call, stop. That code path will be flagged in audit as a recurring adoption miss and rewritten — save the round trip and use the right tool the first time.
 ## sandbox_exec timeout for known-slow commands

package/template/skills/en/task-decomposition/SKILL.md CHANGED Viewed

@@ -126,7 +126,7 @@ Five failure modes recur across projects. Learn to recognize them early.
 ## Integration
-Task decomposition sits between rule extraction and skill authoring in the KC Reborn lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
+Task decomposition sits between rule extraction and skill authoring in the KC lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
 **Input**: A rule catalog from `rule-extraction`. Each rule is an atomic, testable verification requirement. If a rule is not yet atomic, send it back to rule extraction for further decomposition before attempting task decomposition.