npm - kc-beta - Versions diffs - 0.7.5 → 0.8.3 - Mend

kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (81) hide show

package/README.md +47 -0
package/package.json +3 -2
package/src/agent/context.js +17 -1
package/src/agent/engine.js +467 -100
package/src/agent/llm-client.js +24 -1
package/src/agent/pipelines/_advance-hints.js +92 -0
package/src/agent/pipelines/_milestone-derive.js +325 -20
package/src/agent/pipelines/skill-authoring.js +49 -3
package/src/agent/tools/agent-tool.js +2 -2
package/src/agent/tools/consult-skill.js +15 -0
package/src/agent/tools/dashboard-render.js +48 -1
package/src/agent/tools/document-parse.js +31 -2
package/src/agent/tools/phase-advance.js +17 -13
package/src/agent/tools/release.js +343 -7
package/src/agent/tools/sandbox-exec.js +65 -8
package/src/agent/tools/worker-llm-call.js +95 -15
package/src/agent/workspace.js +25 -4
package/src/cli/components.js +4 -1
package/src/cli/index.js +125 -8
package/src/config.js +19 -2
package/src/marathon/driver.js +217 -0
package/src/marathon/prompts.js +93 -0
package/template/.env.template +17 -1
package/template/AGENT.md +2 -2
package/template/skills/en/auto-model-selection/SKILL.md +55 -35
package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
package/template/skills/en/compliance-judgment/SKILL.md +14 -0
package/template/skills/en/confidence-system/SKILL.md +30 -8
package/template/skills/en/corner-case-management/SKILL.md +53 -33
package/template/skills/en/cross-document-verification/SKILL.md +88 -83
package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
package/template/skills/en/data-sensibility/SKILL.md +19 -12
package/template/skills/en/document-chunking/SKILL.md +99 -15
package/template/skills/en/entity-extraction/SKILL.md +14 -4
package/template/skills/en/quality-control/SKILL.md +23 -0
package/template/skills/en/rule-extraction/SKILL.md +92 -94
package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
package/template/skills/en/skill-authoring/SKILL.md +85 -2
package/template/skills/en/skill-creator/SKILL.md +25 -3
package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
package/template/skills/en/task-decomposition/SKILL.md +1 -1
package/template/skills/en/tree-processing/SKILL.md +1 -1
package/template/skills/en/version-control/SKILL.md +15 -0
package/template/skills/en/work-decomposition/SKILL.md +52 -32
package/template/skills/phase_skills.yaml +5 -0
package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
package/template/skills/zh/confidence-system/SKILL.md +34 -9
package/template/skills/zh/corner-case-management/SKILL.md +71 -104
package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
package/template/skills/zh/data-sensibility/SKILL.md +13 -0
package/template/skills/zh/document-chunking/SKILL.md +101 -18
package/template/skills/zh/document-parsing/SKILL.md +65 -65
package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
package/template/skills/zh/entity-extraction/SKILL.md +78 -68
package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
package/template/skills/zh/quality-control/SKILL.md +23 -0
package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
package/template/skills/zh/rule-extraction/SKILL.md +199 -188
package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
package/template/skills/zh/skill-authoring/SKILL.md +136 -58
package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
package/template/skills/zh/skill-creator/SKILL.md +215 -201
package/template/skills/zh/skill-creator/references/schemas.md +60 -60
package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
package/template/skills/zh/task-decomposition/SKILL.md +1 -1
package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
package/template/skills/zh/tree-processing/SKILL.md +67 -63
package/template/skills/zh/version-control/SKILL.md +15 -0
package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
package/template/skills/zh/work-decomposition/SKILL.md +52 -30
package/template/workflows/common/llm_client.py +168 -0
package/template/workflows/common/utils.py +132 -0

package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py CHANGED Viewed

@@ -107,7 +107,7 @@ def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> st
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>KC Reborn — Verification Dashboard</title>
+<title>KC — Verification Dashboard</title>
 <style>
     :root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
     @media (prefers-color-scheme: light) {{

package/template/skills/en/data-sensibility/SKILL.md CHANGED Viewed

@@ -27,23 +27,17 @@ Do this for each new document type. Do it again when document sources change. 30
 After reading, answer these questions explicitly — write the answers down, not just think them:
-**What is consistent across all documents?**
-Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
+**What is consistent across all documents?** Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
-**What varies?**
-Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
+**What varies?** Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
-**What is surprising?**
-Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
+**What is surprising?** Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
-**Document subtypes?**
-Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
+**Document subtypes?** Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
-**Section lengths?**
-Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
+**Section lengths?** Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
-**Encoding issues?**
-Full-width vs half-width characters (１２.５% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
+**Encoding issues?** Full-width vs half-width characters (１２.５% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
 ## Spot-Check Protocol
@@ -105,6 +99,19 @@ When something goes wrong — and it will — you can inspect each stage indepen
 Keep intermediates for at least the current iteration. Delete old iterations only when disk space becomes a real constraint.
+## Looking at the corpus when it doesn't fit in your head
+A foundational constraint to plan around: you have a finite context window. Reading dozens of sample documents in a row will push earlier observations out of your working memory before you finish, leaving you with the impression of having seen the corpus but not the ability to actually generalize from it.
+Treat the corpus the way a statistician would treat a population: sample, summarize, and don't try to keep the population in your head. A few approaches that work in practice:
+- **Use the file system as memory.** Write a `notes/data_observations.md` (or per-rule `notes/<rule_id>_observations.md`) as you scan. Note field name variants, format quirks, missing-section patterns, surprising values. Re-read the notes file next session instead of re-scanning the docs.
+- **Per-rule notepads / memory.md.** For each rule, keep a short `memory.md` that captures "what I've seen across the sample set for this rule" — which documents trigger it, what values appear, what edge cases exist. Update incrementally rather than re-deriving it each time you look at the rule.
+- **Dispatch subagents to explore samples.** When the corpus is large, send a subagent (via the `agent_tool`) to scan a directory and return summary statistics or a short markdown report. The subagent's full reads stay in its own context; you receive only the digest. This is the right tool when you'd otherwise spend context budget reading dozens of files for a single observation.
+- **Statistical / meta views over individual reads.** Instead of reading 20 income certificates, run a regex over all of them and count format variants. Instead of opening every annual report, list filenames and group by issuer / year. Build the meta view first, then dive into representatives.
+The principle: aim for **enough samples to characterize the distribution**, not enough samples to memorize the corpus. The former fits in your head and in your notes. The latter doesn't.
 ## Integration
 Feed your observations into downstream skills:

package/template/skills/en/document-chunking/SKILL.md CHANGED Viewed

@@ -2,32 +2,116 @@
 name: document-chunking
 tier: meta
 description: >
-  Fast, cheap chunking for processing batches of sample and input documents.
-  Use when you need to split documents into manageable pieces for initial observation,
-  data sensibility checks, or feeding to extraction workflows. Not for production
-  verification chunking — for that, use tree-processing to design a tailored chunking script.
+  Split documents into chunks for downstream processing. Use when batching samples
+  for observation, feeding extraction workflows, or breaking long regulation documents
+  into pieces small enough to fit a worker LLM. Covers cheap methods (page, fixed-size,
+  header-based) for quick exploration AND the onion-peeler hierarchical strategy +
+  wedge fallback for production-grade chunking of long structured documents. Also
+  covers the central balance question: chunk-too-big (information lost in a haystack)
+  vs. chunk-too-small (semantic continuity broken).
 ---
 # Document Chunking
-Split documents into pieces for downstream processing. This is the fast, cheap version — for batch processing of samples and inputs, not for precision verification workflows.
+Split documents into pieces for downstream processing. Two regimes:
-## Methods
+- **Cheap chunking** — fast methods for batch observation and exploratory processing of samples.
+- **Hierarchical chunking** — the onion-peeler strategy (borrowed from pdf2skills' methodology) for long structured documents where semantic boundaries matter, with the wedge fallback for stretches that have no headers.
+The most important question across both regimes: **how big should a chunk be**? See "Finding the balance" below before settling on specific sizes.
+## Quick Methods
 **Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
-**Fixed-size chunks** — split by character/token count with overlap. Good for search and initial observation. Typical: 2000-4000 chars with 200 char overlap.
+**Fixed-size chunks** — split by character or token count with overlap. Good for search and initial observation. Typical: a few thousand chars with modest overlap to keep cross-boundary phrases recoverable.
+**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Works when the document has a consistent header convention you can express as regex.
+## Onion Peeler — Hierarchical Strategy (primary for long structured docs)
+Hierarchical, header-based decomposition. Called "onion peeler" because you peel the document layer by layer, from the outermost structure inward.
+### How it works
+1. **Parse the document's heading hierarchy.** Identify all headers at every level (H1, H2, H3 — or the document's equivalent: "Part I", "Chapter 1", "Section 1.1", "Article 1").
+2. **Build a tree.** Each header is a node. Content between headers belongs to the nearest ancestor.
+3. **Check size.** Walk the tree. If a node's content (including all descendants) fits within the processing budget, stop there — that node is one chunk.
+4. **Descend only when needed.** If a node is over budget, descend into its children. Only split when the node is genuinely too large AND has sub-headers available.
+5. **Leaf nodes still over budget** → hand off to the wedge fallback.
+### Why it works
+- Respects the document's own semantic structure. "Chapter 3 — Risk Disclosure" stays as one chunk because that's how the author intended it.
+- Minimizes information loss. Never cuts mid-meaning.
+- Produces variable-size chunks — and that's a feature. A short chapter as one whole chunk is better than the same chapter forcibly split in half.
+### Shortcuts for pattern discovery
+Before building a full parser, explore structural patterns on a few sample documents:
+- Do all chapter headers start with "Chapter X" or "第X章"?
+- Is section numbering consistent (1.1, 1.2, 1.3)?
+- Are there visual markers (bold, specific font, horizontal rules)?
+If you find a stable pattern, a regex-based chunker is faster and more reliable than LLM-based structure detection. Examples:
+- `^第[一二三四五六七八九十百]+章` matches Chinese chapter headers
+- `^Chapter \d+` matches English chapter headers
+- `^\d+\.\d+` matches numbered subsections
+Validate the regex on multiple documents before relying on it.
+## Wedge Fallback (for content without clear headers)
+For dense legal text, continuous prose, or onion-peeler leaf nodes that are still too large with no sub-headers to descend into.
+### How it works
+Uses a **rolling context window** so the algorithm scales to documents of arbitrary length.
+1. **Window the content.** Load up to MAX_TOKENS of unprocessed text into a window (configurable; pick a size your LLM can comfortably read).
+2. **Have the LLM mark cut points.** Prompt the LLM to identify 1-3 natural breakpoints in the window where topic / subject shifts. For each cut point, the LLM returns:
+ - `tokens_before`: ~K tokens (e.g., K=50) preceding the cut, quoted verbatim from the source.
+   - `tokens_after`: ~K tokens following the cut, quoted verbatim.
+ - `chunk_title`: a short title (5-10 chars) for the chunk before the cut.
+3. **Locate cuts via fuzzy match.** The LLM's quoted tokens won't match the source exactly (minor rewording, whitespace differences). Use Levenshtein distance to find the best position. Require a reasonable similarity threshold; fall back to `tokens_before`-only matching if `tokens_after` can't be located.
+4. **Slide and repeat.** Cut the text before the first confirmed breakpoint as a chunk. Slide the window to start at the cut point. Repeat until the remaining text fits in a single chunk.
+### Why it works
+- LLM identifies semantic boundaries, not arbitrary character positions.
+- LLM doesn't regenerate text — it only quotes positions. No hallucination risk.
+- Token-quote + Levenshtein matching is language-agnostic: works on Chinese, English, mixed-language docs.
+- Rolling window scales to any document length.
+- Fuzzy matching handles inevitable small differences between LLM-quoted text and source.
+### When to use it
+- Only when onion-peeler can't proceed (no sub-headers available).
+- For unstructured documents with no formal markers.
+- Cost-aware: this method calls the LLM. Pick the cheapest model that can identify topic boundaries (typically tier 3 or 4 is enough).
+## Finding the balance — when to stop splitting
+The two failure modes:
+- **Chunks too big**: relevant content gets buried in a haystack inside the LLM's context. Even within the LLM's window, attention spreads thin across long inputs — the longer the chunk, the more likely the actual evidence is missed.
+- **Chunks too small**: semantic continuity breaks. A rule that needs "the company is a bank" + "the loan exceeds threshold X" to fire might see those facts split across chunks and lose the conjunction.
+How to find the balance:
-**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Use regex patterns for the document's header convention.
+1. **Anchor on the downstream task, not the LLM's context window.** The chunk should be large enough to contain the evidence a downstream rule needs in one piece. If a rule needs to compare two clauses, those clauses must end up in the same chunk.
+2. **Use semantic boundaries over fixed sizes.** A chunk that ends at a section boundary is more useful than a chunk that hit a target token count mid-sentence. Onion-peeler stops where the document stops; lean on that.
+3. **Test with the actual downstream consumer.** Run a sample extraction or judgment on the chunked output. If the consumer misses evidence that's present in the source, your chunks are wrong shape — usually too big or split at the wrong boundary.
+4. **Track variance, not just average size.** A handful of giant chunks among many small ones is more of a problem than a uniform distribution at any reasonable size. The big ones are where you'd lose information.
+5. **Don't optimize blindly for the LLM's context window.** A 128K context model can technically swallow a 100K chunk; the attention to retrieve specific evidence from that chunk is a different question. Smaller, well-bounded chunks usually win.
-## When to Use What
+## Practical Tips
-Pick the simplest method that serves the task:
-- Batch document observation → page-level
-- Full-text search index → fixed-size with overlap
-- Section-level extraction → header-based
-- Table of contents available → parse TOC for structure
+- **Chunk size depends on the downstream task.** Rule extraction by the coding agent can take very large chunks. Worker LLM verification needs chunks that comfortably fit inside its context with room for prompt + response.
+- **Preserve context.** When splitting, carry the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so the downstream consumer knows where it sits.
+- **Cache the chunk tree.** Once a document's structure is parsed, save the tree. Many rules may need the same document's content; re-parsing is waste.
+- **Log chunking decisions.** Which strategy was used, how many chunks were produced, what the size distribution looks like. Helpful for downstream debugging.
 ## Relationship to tree-processing
-This skill is for quick, cheap chunking during exploration and batch processing. When you need production-grade chunking for verification workflows — where the chunking mechanism must be precise, consistent, and coded as a script — use `tree-processing` instead.
+This skill covers chunking methods. `tree-processing` covers designing the precise, coded chunking script for production verification workflows — where chunking must be deterministic, reproducible, and tested. Reach for `tree-processing` when the cheap methods above don't give you enough control for the production path.

package/template/skills/en/entity-extraction/SKILL.md CHANGED Viewed

@@ -38,11 +38,9 @@ Extraction method selection is a cost-accuracy search. The goal is finding the c
 ### Available Methods
-**Regex / Python** — Cost: zero. Speed: instant. Deterministic.
-Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
+**Regex / Python** — Cost: zero. Speed: instant. Deterministic. Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
-**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding.
-Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
+**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding. Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
 Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
@@ -119,3 +117,15 @@ When designing extraction for worker LLM workflows:
 3. If the section exceeds available context, narrow further via tree processing.
 4. Always leave room for the model's response.
 5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
+## Extraction has corner cases too
+Extraction is **as important as judgment** for final accuracy. A common observation across projects: more than half of the final errors trace back to extraction problems, not judgment — the extractor returned the wrong value, the wrong unit, or pulled from the wrong section, and the judge faithfully concluded the wrong verdict from the wrong input.
+Treat extraction with the same iteration discipline as judgment:
+- **Reflection / iteration**: after running an extractor on the sample set, look at the cases where it failed. Is the failure a missing pattern (add to the prompt or regex)? A format quirk (unit conversion, locale)? A document-class issue (extractor right for class A but wrong for class B)?
+- **Corner-case registration**: when an extraction failure can't be fixed without disproportionate cost to the standard extractor, log it as a corner case in `corner-case-management` — same registry shape as a judgment corner case, just resolution typed as `code` / `prompt` / `parser`-class transformation.
+- **Validate the extractor independently of the judge**: an end-to-end test that fails only on the judgment side may hide a bad extractor whose outputs happen to verdict correctly *most* of the time. Use QC review to spot-check extracted values, not just final verdicts.
+When you're tempted to fix accuracy by tuning the judge's prompt, first check whether the extractor is giving the judge the right input. The cheaper, more durable fix is almost always in the extractor.

package/template/skills/en/quality-control/SKILL.md CHANGED Viewed

@@ -8,6 +8,20 @@ description: Design and execute quality control for production verification work
 Quality control is the Observer role. You are watching the worker LLMs perform and deciding whether they are doing it well enough. The goal is not to review every result — that would defeat the purpose of automation. The goal is to review just enough to maintain confidence that the system is working.
+## How this skill cooperates with the others
+Quality control is one part of a tightly-cooperating set of skills. Don't replicate content from a sibling skill here — point to it. Skills loaded together in the same phase are already accessible to the conductor; re-injecting their material into this skill just bloats both.
+The relationships:
+- `confidence-system` defines how confidence is composed and calibrated. When QC uses confidence to triage which results need more review, it consumes confidence — but the design of confidence belongs there.
+- `evolution-loop` is the closed-loop machinery for turning QC findings into improvements. QC produces signals (failures, drift, recurring patterns); evolution-loop decides how to act on them.
+- `corner-case-management` is where exceptions discovered by QC live. QC surfaces "this one didn't fit"; corner-case-management decides whether it's a corner case to register, a systemic problem to promote to mainline, or a data-quality issue to escalate.
+- `cross-document-verification` is its own check class. QC's job is to verify those rules are running as designed, not to re-explain how to build them.
+- `dashboard-reporting` is where QC results surface to the developer user. QC produces the data; the dashboard renders it.
+Practical implication for authoring: if you find yourself writing in this file something that more naturally belongs to one of the skills above, write a one-sentence pointer here ("see `confidence-system` for how confidence is composed") and leave the depth in the right place. The conductor will have the other skill loaded when it needs the detail.
 ## Five-Layer QA Architecture
 Quality control is not one activity — it is five layers that build on each other. Lower layers must pass before higher layers run.
@@ -121,6 +135,15 @@ There are two distinct dashboards in this system:
 When a release is built, point end users at the bundled dashboard, not the workspace one. Workspace dashboard stays your developer surface.
+## Re-release after substantive changes
+A release bundle is a snapshot of `workflows/` and `rule_skills/` at the moment the `release` tool ran. If you modify any `workflows/<rule>/workflow_v*.py`, `rule_skills/<id>/SKILL.md`, or `check.py` AFTER the release was built, the shipped artifact no longer reflects your actual work. Engine's milestone derivation will surface `releaseIsStale: true` with the divergent file list.
+When this fires:
+- **Substantive change** (new hybrid path, fixed verdict logic, added rule): re-run the `release` tool to produce a fresh bundle.
+- **Cosmetic edit only** (typo, comment, formatting): write `.accept_stale_release` into the release directory to acknowledge — `touch output/releases/<slug>/.accept_stale_release`.
+- **DON'T** declare finalization done while a stale release ships. Downstream consumers (other agents, deployed verification systems) read the bundled `parser_v*.py` / `workflows/`, not the workspace.
 ## Developer User Involvement
 The developer user should see QC results through the dashboard (see `dashboard-reporting`). Key metrics to surface:

package/template/skills/en/rule-extraction/SKILL.md CHANGED Viewed

@@ -16,6 +16,12 @@ Data/entity extraction (`entity-extraction`) is the **repeating task** that runs
 Don't conflate the two. Rule extraction happens once; data extraction happens on every document.
+## Source-first sequencing
+Extract rules from the source text FIRST. Only after you have a complete first-pass catalog from sources alone should you open sample documents. The temptation is to peek at samples early to "see what kinds of rules matter" — this biases you toward rules the samples happen to exercise and silently drops rules the samples don't cover.
+A domain professional reads the source material, builds an understanding, then validates on samples — not the reverse. KC's differentiator over general-purpose agents is systematic accuracy across long context; that advantage compounds when you ground in the SOURCE not the EXAMPLES.
 ## Rule Structure: Location → Extraction → Judgment
 Every verification rule decomposes into three parts:
@@ -62,22 +68,17 @@ When rules change (additions, modifications, deprecations), version the entire r
 ## Granularity Calibration (read before extracting)
-A well-extracted rule catalog has **10-20 rules per typical regulation PDF**
-(a 30-80 page disclosure regulation). Over-extraction into 60-100 rules per
-regulation signals you're treating every clause as its own rule — downstream
-consumers (skill-authoring, workflow-run) can't distinguish meaningful
-checks from boilerplate.
-If your first pass produces more than ~25 rules for a single regulation:
-- **Merge rules that share evidence and fail together** (e.g., "must
-  disclose X" and "must disclose Y" where both come from the same
-  required-fields table → one rule: "must disclose the required-fields
-  list including X, Y").
-- **Drop procedural language** that isn't checkable against a report
-  (definitions, scope statements, references to other regs that just
-  transitively apply).
-- **Keep only checkable obligations, prohibitions, and thresholds** —
-  things where you can read a sample report and say pass or fail.
+Rule catalogs come from diverse source materials — formal regulations, internal handbooks, case law, legal opinions, expert rule tables, regulator Q&A. There is no universal "right number of rules per page". Calibrate by logic, not by count:
+- **Atomicity is the real test.** A rule that can produce two independent pass/fail outcomes is two rules. A rule whose verdict requires verifying three different paragraphs of the source is probably three rules.
+- **Boilerplate is not a rule.** Definitions, scope statements, transitive references to other regulations, and procedural language that can't be checked against the target document do not become rules.
+- **Keep only checkable obligations, prohibitions, and thresholds** — things where you can read a target document and say pass / fail / not-applicable.
+If your first pass feels too coarse (one rule per chapter, ignoring multiple distinct obligations within) — go finer. If it feels too fine (every clause in a definitions section is its own rule) — merge or drop. Then:
+- **Merge rules that share evidence and fail together** (e.g., "must disclose X" and "must disclose Y" where both come from the same required-fields table → one rule: "must disclose the required-fields list including X, Y").
+- **Drop procedural language** that isn't checkable against a target document.
+- **Convert each surviving rule into a falsifiability statement** — if you can't state precisely what would make it fail, you don't have a rule yet.
 ### Sample "good" rule
@@ -94,104 +95,58 @@ If your first pass produces more than ~25 rules for a single regulation:
 }
 ```
-Note: one pass/fail outcome, a single `source_ref` to a specific clause,
-clear applicability scope. Skill-authoring can write `check_r014.py` from
-this alone.
+Note: one pass/fail outcome, a single `source_ref` to a specific clause, clear applicability scope. Skill-authoring can write `check_r014.py` from this alone.
-### Cross-regulation dedup (when working across multiple PDFs)
+### Cross-source dedup (when working across multiple documents)
-If the developer user provides N regulations, rules from later regs often
-duplicate cross-cutting requirements already captured by earlier ones
-(e.g., a 2018 generic disclosure rule vs. a 2025 specific version).
-Before emitting a rule from reg N:
+If the developer user provides N source documents, rules from later sources often duplicate cross-cutting requirements already captured by earlier ones (e.g., a generic disclosure rule from an older regulation vs. a newer specific version of the same obligation). Before emitting a rule from source N:
-1. **Check the existing catalog.** Use `rule_catalog` (operation: list)
-   to see what's already there. Skip if a rule with equivalent scope +
-   intent exists.
+1. **Check the existing catalog.** Use `rule_catalog` (operation: list) to see what's already there. Skip if a rule with equivalent scope + intent exists.
 2. **Prefer the newer / more specific source_ref** when rules overlap.
-3. **If you merged rules**, record the consolidated sources in
-   `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
+3. **If you merged rules**, record the consolidated sources in `source_ref`: e.g., `"New Reg §15.2 + Old Reg §24"`.
 ### Delegation to sub-agents
-If you dispatch extraction to sub-agents (one per regulation), the
-sub-agent inherits ONLY its `task_description` — it cannot see your
-conversation or existing catalog. Therefore, when composing the brief:
-- **Specify the target count band** explicitly: "Extract 10-20 atomic
-  rules from this regulation."
-- **Include a sample rule** in the brief body (paste the JSON above
-  verbatim) so the sub-agent's calibration matches yours.
-- **Name every regulation the sub-agent should process.** If AGENT.md
-  lists 10 core regulations, the brief must list all 10 by name, not
-  "the core regs" as a pronoun — LLMs composing long structured briefs
-  frequently drop items (observed in session 6304673afaa0 where reg 02
-  was silently omitted).
-- **State the dedup contract**: "Rules already in the parent's catalog
-  (R001–Rnnn) should NOT be re-extracted. If a requirement is already
-  covered, skip it." Then pass the current catalog's ID ranges.
-- **Prefer `rule_catalog` create operations over sandbox_exec writes to
-  catalog.json.** rule_catalog uses workspace file locking;
-  sandbox_exec bypasses it and races with other writers.
-## How to read regulation files (default: read whole)
-Regulations are the audit's authoritative basis. Every `source_ref`
-in your extracted rules must be verifiable against the source text.
-For typical regulation documents (a single file under ~50 KB / under
-~100 pages), **read each regulation file whole using `workspace_file`
-(operation=read) in a single call**:
+If you dispatch extraction to sub-agents (one per source document), the sub-agent inherits ONLY its `task_description` — it cannot see your conversation or existing catalog. Therefore, when composing the brief:
+- **Anchor calibration with a concrete sample rule.** Paste the JSON above verbatim into the brief body so the sub-agent's atomicity calibration matches yours.
+- **Name every source document the sub-agent should process.** If AGENT.md lists 10 core source documents, the brief must list all 10 by name, not "the core regs" as a pronoun — LLMs composing long structured briefs frequently drop items silently.
+- **State the dedup contract**: "Rules already in the parent's catalog (R001–Rnnn) should NOT be re-extracted. If a requirement is already covered, skip it." Then pass the current catalog's ID ranges.
+- **Prefer `rule_catalog` create operations over sandbox_exec writes to catalog.json.** rule_catalog uses workspace file locking; sandbox_exec bypasses it and races with other writers.
+## How to read source files (default: read whole)
+Source documents are the catalog's authoritative basis. Every `source_ref` in your extracted rules must be verifiable against the source text. For typical source documents (a single file under ~50 KB / under ~100 pages), **read each source file whole using `workspace_file` (operation=read) in a single call**:
 ```js
-workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_regulation.md" })
+workspace_file({ operation: "read", scope: "project", path: "Rules/01_some_source.md" })
 ```
-`workspace_file.read` is capped at 50,000 chars per call, which
-covers virtually every individual regulation document. This is the
-default. **Read every regulation file whole before you start
-extracting rules from any of them.**
+`workspace_file.read` is capped at 50,000 chars per call, which covers virtually every individual source document. This is the default. **Read every source file whole before you start extracting rules from any of them.**
 ### Tool choice — `workspace_file` vs `sandbox_exec`
 | Tool | Per-call cap | Use for |
 |---|---:|---|
-| `workspace_file` (read) | 50,000 chars | **full reads of regulation / rule documents** |
+| `workspace_file` (read) | 50,000 chars | **full reads of source / rule documents** |
 | `sandbox_exec` (cat/head/etc) | 10,000 chars | shell commands, **not** full file reads |
-`sandbox_exec` is designed for shell commands; its 10K cap is too
-small for most regulations. `cat rules/01_*.md` returns only the
-first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` /
-`tail -M` to scroll the window loses positional precision and burns
-turns. **When you see truncation, don't fight the cap — switch
-tools.**
+`sandbox_exec` is designed for shell commands; its 10K cap is too small for most regulations. `cat rules/01_*.md` returns only the first ~10 KB followed by `\n[truncated]`. Re-issuing with `head -N` / `tail -M` to scroll the window loses positional precision and burns turns. **When you see truncation, don't fight the cap — switch tools.**
-### Asymmetry — regs read whole, samples sampled
+### Asymmetry — sources read whole, samples sampled
-Regulations are limited (typically 1-10 files), authoritative, and
-read once. Read every regulation whole.
+Source documents are limited (typically 1-10 files), authoritative, and read once. Read every source file whole.
-Sample documents may number 30 to 1000+, are heterogeneous, and get
-read many times during testing. **Don't try to read every sample
-whole.** Use rule-applicability filters or sampled subsets to focus
-attention.
+Sample documents may number 30 to 1000+, are heterogeneous, and get read many times during testing. **Don't try to read every sample whole.** Use rule-applicability filters or sampled subsets to focus attention.
-### Escape valve — when a single reg exceeds ~200K chars
+### Escape valve — when a single source exceeds ~200K chars
-Rare in practice. The largest regulation in `test_data_4` is 42 KB;
-typical Chinese banking regs (资管新规, 信披办法, etc.) all fit
-under 50 KB. But if you do encounter a single regulation so large
-that reading it whole would crowd the context window — heuristic:
-the file exceeds ~200,000 chars or ~25% of your context budget —
-use your own judgment:
+Rare in practice — most regulation, handbook, or rule-table documents fit comfortably under 50 KB. But if you do encounter a single source document so large that reading it whole would crowd the context window — heuristic: the file exceeds ~200,000 chars or ~25% of your context budget — use your own judgment:
-- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse`
-  or paginated `workspace_file` reads
-- Or build an in-workspace index file pointing to chapter offsets and
-  read on-demand per rule being extracted
+- Read by chapter (e.g., `第X章` / `Chapter X`) using `document_parse` or paginated `workspace_file` reads
+- Or build an in-workspace index file pointing to chapter offsets and read on-demand per rule being extracted
-The 50 KB cap is high enough that this almost never triggers. **The
-default is read whole; deviate only when the file genuinely doesn't
-fit.**
+The 50 KB cap is high enough that this almost never triggers. **The default is read whole; deviate only when the file genuinely doesn't fit.**
 ## Extraction Strategies
@@ -202,11 +157,14 @@ When the developer user provides rules in xlsx, csv, or a structured document wh
 - Map each row to a rule, preserving the developer user's identifiers.
 - Ask clarifying questions only if entries are ambiguous.
-### Strategy 2: Hierarchical Extraction from Regulation Text
+### Strategy 2: Hierarchical Extraction from Source Text
-For raw regulation documents (PDF, DOCX, legal text):
+For raw source documents (PDF, DOCX, legal text, handbooks, case collections):
 1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
+   Before extracting any rule, traverse the table of contents and section headers end-to-end. Sketch the rule-bearing hierarchy: which chapters impose obligations, which are definitions / context. A common failure mode: a long source with many articles yields disproportionately few rules — almost always meaning you stopped surveying after the high-density chapters. Decide your rule-bearing chapter span explicitly, then justify deviations relative to that span rather than to a single global count target.
 2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
 3. **Peel the onion.** Start at the highest structural level and work downward:
    - Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
@@ -216,7 +174,7 @@ For raw regulation documents (PDF, DOCX, legal text):
 4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
 5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
-For long documents (100+ pages), use the onion-peeler approach described in `references/chunking-strategies.md`. Do not try to read the entire document in one pass.
+For long documents, use the onion-peeler approach — see the `document-chunking` skill for the full strategy and the wedge-driving fallback for sections without clear headers. Do not try to read the entire document in one pass.
 ### Strategy 3: Expert Notes
@@ -285,6 +243,8 @@ Do not skip ambiguous rules. They are often the most important ones.
 ## Sanity-check applicability against the sample corpus
+> This is a validation pass, not a discovery pass. Do not let 0-sample rules tempt you to delete them at this stage — first ask whether the source requires them; if yes, keep them as "future scope" rather than drop.
 After extracting your rule catalog and before authoring skills, do this 5-minute check: project each rule's applicability filter against the sample corpus.
 For every rule:
@@ -292,14 +252,52 @@ For every rule:
 2. For each rule, count how many samples it would apply to (per the rule's `applicability` field, scope filter, or whatever shape your catalog uses)
 3. Flag rules that apply to **0 samples** — they're either genuinely test-corpus-irrelevant (acceptable) or over-constrained (bug)
-E2E #7 GLM produced a 97-rule catalog where 36 rules (37%) had `PASS=0 FAIL=0 NOT_APPLICABLE=90` across all 90 documents — they never fired. Some were legit (rules for cash-management products with no cash-management samples in corpus), but 36 inactive of 97 was high enough to suggest scope-too-narrow drift.
+A failure mode worth flagging: a catalog where a large fraction of rules (say 30-40%) return `PASS=0 FAIL=0 NOT_APPLICABLE=all` across the entire sample set. Some inactive rules are legitimate (the source requires checks for a product type the corpus doesn't happen to contain), but a high inactive ratio almost always signals scope-too-narrow drift — applicability filters that over-specify.
 If many rules are 0-sample, either:
 - **Reframe their applicability** — broaden product types, look for evidence in headers/footers not just body, relax the scope filter
 - **Document them as "future scope"** and remove from this iteration's catalog (still capture them in a `rules/future_scope.md` so they're not forgotten)
 - **Update the test corpus** to include matching samples (work with the developer user)
-Catching this in `rule_extraction` is much cheaper than authoring 36 skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
+Catching this in `rule_extraction` is much cheaper than authoring N skills that then test as inactive in `skill_testing`. The cheap projection here is worth the time it saves later.
+## Logic-type taxonomy (coverage diagnostic)
+After first-pass extraction, classify each rule by judgment type:
+- **Threshold** — numeric comparison ("annualized rate ≥ 15.4%")
+- **Decision-Tree** — multi-branch ("if product type ∈ {A, B} then ...")
+- **Heuristic** — semantic judgment ("does marketing copy imply principal guarantee")
+- **Process** — procedural compliance ("published within the required deadline")
+If your catalog is 90% Threshold rules, you have likely missed the semantic / process obligations that don't reduce to a number. Re-survey for those. The four types are roughly comparable in frequency across most rule corpora; a heavy skew is a signal to look again at the chapters or sections you skimmed.
+## Preserve specifics (anti-summarize)
+When writing a rule's `description` and `falsifiability_statement`, preserve every threshold, percentage, deadline, and named entity from the source. "Disclose within a reasonable period" is a vague rule and will fail downstream — the source almost certainly says "within 15 business days." If the source IS genuinely vague, flag the ambiguity explicitly (e.g., `notes: "source uses '及时'; no numeric deadline"`) rather than smoothing it. Downstream skill-authoring will need the specifics to write check.py logic.
+## Soft sample-access discipline
+You have unlimited tool access to samples — KC does not cap you. The discipline is procedural: source-extraction phase first, then validation phase. Inside source-extraction phase, samples are a last-resort reference for clarifying terminology, not a discovery surface. If you find yourself opening sample N° 3 to figure out what to extract next, you have inverted the methodology — close the sample, return to the source. Acceptable narrow exceptions:
+- A jargon term in the source needs example resolution
+- Sanity-check that a rule's `description` field reads coherently when applied to a real document
+## Primary vs auxiliary sources — iteration order, NOT coverage breadth
+When the developer user labels some source documents "primary" and others "auxiliary" (or "supplementary", or "secondary"), that distinction is about **iteration order**: do the primary regs deeply first, then come back to the auxiliary ones. It is **NOT** a license to skip the auxiliary regs entirely.
+A recurring failure mode worth flagging: agent reads "primary 01-02 are the main basis, the rest is auxiliary" and produces 13 rules from regs 01-02 + 2 rules from regs 03-04 + zero rules from regs 05-10. The auxiliary regulations (often 60-90 articles each in compliance domains) almost always contain core obligations the primary regs reference or assume. Extracting nothing from them produces a thin catalog that misses real compliance requirements.
+The right interpretation: primary regs get the first deep pass, the auxiliary regs get a structural-survey pass at minimum — identify their core obligations and extract those, even if not at the same density as primary. Skipping a 80-article regulation entirely should require an explicit reason in `coverage_audit.md` (e.g., "regulation 05 covers fund operations outside our case scope; explicitly out-of-scope per user discussion"). Silent skipping is the failure mode.
+## Coverage trace (recommended deliverable)
+After extraction, walk the source document paragraph-by-paragraph and tag each as either:
+- `covered_by: [Rxxx, Ryyy]` — articles whose obligations became one or more rules
+- `non_checkable: definition | context | cross_ref | scope` — articles excluded with explicit reason
+Write this as `rules/coverage_trace.md` (or a section in `coverage_audit.md`). This is the source-side mirror of the existing sample-side applicability check, and catches the "long source → suspiciously few rules" failure mode directly. Engine derivation can read this trace to validate completeness later.
 ## When Rules Change